如何解决R子集/保留至少具有两个特定文本字符串的所有行
我有一个带有不同文本摘录的数据框。
我希望将包含我的小词典中至少至少2个术语(“贫困|报告|警报|通货膨胀”)或同一术语两次的所有观察结果进行子集化(例如报告在文本中出现两次)。
texts <- data.frame(text = c("report highlights that poverty is widespread","there is inflation","alarming reports","thanks for listening"),id = 1:4,group = 4:7)
texts[grepl("poverty|report|alarming|inflation",texts$text,ignore.case=T),]
# I don't want this: text id group
#1 report highlights that poverty is widespread 1 4
#2 there is inflation 2 5
#3 alarming reports 3 6
但是我想要这个:
# text id group
#1 report highlights that poverty is widespread 1 4
#3 alarming reports 3 6
解决方法
这项工作:
> library(stringr)
> library(dplyr)
> texts %>% filter(str_count(text,pattern = "poverty|report|alarming|inflation") > 1)
text id group
1 report highlights that poverty is widespread 1 4
2 alarming reports 3 6
>
,
尝试这种base R
方法:
#Data
texts <- data.frame(text = c("report highlights that poverty is widespread","there is inflation","alarming reports","thanks for listening"),id = 1:4,group = 4:7,stringsAsFactors = F)
#Index
Index <- apply(texts[,1,drop=F],function(x)sum(grepl("poverty|report|alarming|inflation",unlist(strsplit(x,split =' ')),ignore.case=T)))
#Subset
texts[which(Index>=2),]
输出:
text id group
1 report highlights that poverty is widespread 1 4
3 alarming reports 3 6
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。