如何解决在文本段落中搜索单词,然后在R中标记它们
我有一个文本数据集,想要在其中搜索各种单词,然后在找到它们时对其进行标记。这是示例数据:
df <- data.table("id" = c(1:3),"report" = c("Travel opens our eyes to art,history,and culture – but it also introduces us to culinary adventures we may have never imagined otherwise.","We quickly observed that no one in Sicily cooks with recipes (just with the heart),so we now do the same.","We quickly observed that no one in Sicily cooks with recipes so we now do the same."),"summary" = c("On our first trip to Sicily to discover our family roots,","If you’re not a gardener,an Internet search for where to find zucchini flowers results.","add some fresh cream to make the mixture a bit more liquid,"))
到目前为止,我一直在使用SQL来处理此问题,但是当您要查找的单词列表很多时,它就会变得充满挑战。
dfOne <- sqldf("select id,case when lower(report) like '%opens%' then 1 else 0 end as opens,case when lower(report) like '%cooks%' then 1 else 0 end as cooks,case when lower(report) like '%internet%' then 1 else 0 end as internet,case when lower(report) like '%zucchini%' then 1 else 0 end as zucchini,case when lower(report) like '%fresh%' then 1 else 0 end as fresh
from df
")
我正在寻找以更有效的方式实现此目标的想法。想象一下,如果目标词列表很长,那么这段代码可能会变得不必要地太长。
谢谢
SM。
解决方法
1)sqldf
定义单词的向量,然后将其转换为SQL。请注意,case when
不需要,因为like
已经产生了0/1的结果。在sqldf
前面加上fn$
使$like
可以将R like
字符串替换为SQL语句。使用verbose=TRUE
的{{1}}参数查看生成的SQL语句。无论sqldf
多长,这仅是两行代码。
words
给予:
words <- c("opens","cooks","internet","zucchini","fresh","test me")
like <- toString(sprintf("\nlower(report) like '%%%s%%' as '%s'",words,words))
fn$sqldf("select id,$like from df",verbose = TRUE)
2)外部
使用上方的 id opens cooks internet zucchini fresh test me
1 1 1 0 0 0 0 0
2 2 0 1 0 0 0 0
3 3 0 1 0 0 0 0
,可以如下使用words
。注意,外部的函数(第三个参数)必须向量化,我们可以使outer
向量化,如图所示。如果您不介意与带有空格或双截词的单词相关联的列名与语法R变量名相冲突的情况,请忽略grepl
。产生的输出与(1)相同。
check.names = FALSE
3)申请
使用with(df,data.frame(
id,+t(outer(setNames(words,words),report,Vectorize(grepl))),check.names = FALSE
))
可以得到与(2)相同的略短的解决方案。输出与(1)和(2)中的输出相同。
sapply
,
这是一种整齐的方式。假定您要搜索两个单独的列。
library(tidyverse)
df <- tibble(id = c(1:3),report = c("Travel opens our eyes to art,history,and culture – but it also introduces us to culinary adventures we may have never imagined otherwise.","We quickly observed that no one in Sicily cooks with recipes (just with the heart),so we now do the same.","We quickly observed that no one in Sicily cooks with recipes so we now do the same."),summary = c("On our first trip to Sicily to discover our family roots,","If you’re not a gardener,an Internet search for where to find zucchini flowers results.","add some fresh cream to make the mixture a bit more liquid,"))
# Vector of words
vec <- c('eyes','art','gardener','mixture','trip')
df %>%
mutate(reportFlag = case_when(
str_detect(report,paste(vec,collapse = '|')) ~ T,T ~ F)
) %>%
mutate(summaryFlag = case_when(
str_detect(report,T ~ F))
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。