如何解决如何查找一列中的单词出现在另一列中的时间
我正在寻找一种R的解决方案,以查找数据帧的另一列中某个列/列中的单词出现多少次。
我有一个包含4列(页面,文本,野生动物和动物)的DF。
df <- tibble::tibble(page=c(12,6,9,18,2),text=c("Dogs are related to wolves,but dogs are friendly","I love pets","I like goat and deer. Deer and goat","Zebra have stripes on their body","Lizards are Crocodiles have tails"
))
wildanimals <- c("wolves","tiger","deer","zebra","crocodile")
animals <- c("dogs","cats","goat","horse","lizard")
cbind(df,animals,wildanimals)
我想检查列文本中是否存在列动物和野生动物中的单词以及有多少次。像这样:
frequency <- c("3","0","4","1","2")
cbind(df,wildanimals,frequency)
我在这里也曾问过类似的问题:Link to the Question,但它只能告诉您该词是否存在。
解决方法
我们可以使用str_count
来计算“动物”中的单词,“野生动物”可以将它们加在一起(+
)
library(dplyr)
library(stringr)
df1 <- df1 %>%
mutate(frequency = str_count(text,coll(animals,ignore_case = TRUE)) +
str_count(text,coll(wildanimals,ignore_case = TRUE)))
df1
# page text animals wildanimals frequency
#1 12 Dogs are related to wolves,but dogs are friendly dogs wolves 3
#2 6 I love pets cats tiger 0
#3 9 I like goat and deer. Deer and goat goat deer 4
#4 18 Zebra have stripes on their body horse zebra 1
#5 2 Lizards are Crocodiles have tails lizard crocodile 2
或者以一种紧凑的方式,paste
将“动物”,“野生动物”列与sep
作为|
并用regex
包裹起来而不是coll
中的str_count
df1 <- df1 %>%
mutate(frequency = str_count(text,regex(str_c(animals,wildanimals,sep="|"),ignore_case = TRUE)))
df1
# page text animals wildanimals frequency
#1 12 Dogs are related to wolves,but dogs are friendly dogs wolves 3
#2 6 I love pets cats tiger 0
#3 9 I like goat and deer. Deer and goat goat deer 4
#4 18 Zebra have stripes on their body horse zebra 1
#5 2 Lizards are Crocodiles have tails lizard crocodile 2
注意:%>%
不会在适当位置创建该列,因此我们需要将其重新分配(<-
)到原始对象。如果我们想在原始对象“ df1”中适当地创建列,请使用magrittr
运算符(%<>%
)
使用str_count
的另一种方法是将animals
和wildanimals
与glue::glue()
合并。然后与text
的小写版本匹配:
library(tidyverse)
df %>% mutate(frequency = str_count(tolower(text),glue("{animals}|{wildanimals}")))
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。