如何查找一列中的单词出现在另一列中的时间

如何解决如何查找一列中的单词出现在另一列中的时间

我正在寻找一种R的解决方案，以查找数据帧的另一列中某个列/列中的单词出现多少次。

我有一个包含4列（页面，文本，野生动物和动物）的DF。

df <- tibble::tibble(page=c(12,6,9,18,2),text=c("Dogs are related to wolves,but dogs are friendly","I love pets","I like goat and deer. Deer and goat","Zebra have stripes on their body","Lizards are Crocodiles have tails"
                        ))
wildanimals <- c("wolves","tiger","deer","zebra","crocodile")
animals <- c("dogs","cats","goat","horse","lizard")

cbind(df,animals,wildanimals)

我想检查列文本中是否存在列动物和野生动物中的单词以及有多少次。像这样：

frequency <- c("3","0","4","1","2")
cbind(df,wildanimals,frequency)

我在这里也曾问过类似的问题：Link to the Question，但它只能告诉您该词是否存在。

解决方法

我们可以使用str_count来计算“动物”中的单词，“野生动物”可以将它们加在一起（+）

library(dplyr)
library(stringr)
df1 <- df1 %>% 
   mutate(frequency = str_count(text,coll(animals,ignore_case = TRUE)) + 
                      str_count(text,coll(wildanimals,ignore_case = TRUE)))
df1
#  page                                              text animals wildanimals frequency
#1   12 Dogs are related to wolves,but dogs are friendly    dogs      wolves         3
#2    6                                       I love pets    cats       tiger         0
#3    9               I like goat and deer. Deer and goat    goat        deer         4
#4   18                  Zebra have stripes on their body   horse       zebra         1
#5    2                 Lizards are Crocodiles have tails  lizard   crocodile         2

或者以一种紧凑的方式，paste将“动物”，“野生动物”列与sep作为|并用regex包裹起来而不是coll中的str_count

df1 <- df1 %>%
   mutate(frequency = str_count(text,regex(str_c(animals,wildanimals,sep="|"),ignore_case = TRUE)))
df1
#   page                                              text animals wildanimals frequency
#1   12 Dogs are related to wolves,but dogs are friendly    dogs      wolves         3
#2    6                                       I love pets    cats       tiger         0
#3    9               I like goat and deer. Deer and goat    goat        deer         4
#4   18                  Zebra have stripes on their body   horse       zebra         1
#5    2                 Lizards are Crocodiles have tails  lizard   crocodile         2

注意：%>%不会在适当位置创建该列，因此我们需要将其重新分配（<-）到原始对象。如果我们想在原始对象“ df1”中适当地创建列，请使用magrittr运算符（%<>%）

使用str_count的另一种方法是将animals和wildanimals与glue::glue()合并。然后与text的小写版本匹配：

library(tidyverse)

df %>% mutate(frequency = str_count(tolower(text),glue("{animals}|{wildanimals}")))

如何查找一列中的单词出现在另一列中的时间

如何解决如何查找一列中的单词出现在另一列中的时间

解决方法

相关推荐