在文本段落中搜索单词，然后在R中标记它们

如何解决在文本段落中搜索单词，然后在R中标记它们

我有一个文本数据集，想要在其中搜索各种单词，然后在找到它们时对其进行标记。这是示例数据：

df <- data.table("id" = c(1:3),"report" = c("Travel opens our eyes to art,history,and culture – but it also introduces us to culinary adventures we may have never imagined otherwise.","We quickly observed that no one in Sicily cooks with recipes (just with the heart),so we now do the same.","We quickly observed that no one in Sicily cooks with recipes so we now do the same."),"summary" = c("On our first trip to Sicily to discover our family roots,","If you’re not a gardener,an Internet search for where to find zucchini flowers results.","add some fresh cream to make the mixture a bit more liquid,"))

到目前为止，我一直在使用SQL来处理此问题，但是当您要查找的单词列表很多时，它就会变得充满挑战。

dfOne <- sqldf("select id,case when lower(report) like '%opens%' then 1 else 0 end as opens,case when lower(report) like '%cooks%' then 1 else 0 end as cooks,case when lower(report) like '%internet%' then 1 else 0 end as internet,case when lower(report) like '%zucchini%' then 1 else 0 end as zucchini,case when lower(report) like '%fresh%' then 1 else 0 end as fresh
      from df
      ")

我正在寻找以更有效的方式实现此目标的想法。想象一下，如果目标词列表很长，那么这段代码可能会变得不必要地太长。

谢谢

SM。

解决方法

1）sqldf

定义单词的向量，然后将其转换为SQL。请注意，case when不需要，因为like已经产生了0/1的结果。在sqldf前面加上fn$使$like可以将R like字符串替换为SQL语句。使用verbose=TRUE的{{1}}参数查看生成的SQL语句。无论sqldf多长，这仅是两行代码。

words

给予：

words <- c("opens","cooks","internet","zucchini","fresh","test me")

like <- toString(sprintf("\nlower(report) like '%%%s%%' as '%s'",words,words))
fn$sqldf("select id,$like from df",verbose = TRUE)

2）外部

使用上方的id opens cooks internet zucchini fresh test me 1 1 1 0 0 0 0 0 2 2 0 1 0 0 0 0 3 3 0 1 0 0 0 0，可以如下使用words。注意，外部的函数（第三个参数）必须向量化，我们可以使outer向量化，如图所示。如果您不介意与带有空格或双截词的单词相关联的列名与语法R变量名相冲突的情况，请忽略grepl。产生的输出与（1）相同。

check.names = FALSE

3）申请

使用with(df,data.frame( id,+t(outer(setNames(words,words),report,Vectorize(grepl))),check.names = FALSE ))可以得到与（2）相同的略短的解决方案。输出与（1）和（2）中的输出相同。

sapply

这是一种整齐的方式。假定您要搜索两个单独的列。

library(tidyverse)

df <- tibble(id = c(1:3),report = c("Travel opens our eyes to art,history,and culture – but it also introduces us to culinary adventures we may have never imagined otherwise.","We quickly observed that no one in Sicily cooks with recipes (just with the heart),so we now do the same.","We quickly observed that no one in Sicily cooks with recipes so we now do the same."),summary = c("On our first trip to Sicily to discover our family roots,","If you’re not a gardener,an Internet search for where to find zucchini flowers results.","add some fresh cream to make the mixture a bit more liquid,"))


# Vector of words
vec <- c('eyes','art','gardener','mixture','trip')

df %>% 
  mutate(reportFlag = case_when(
    str_detect(report,paste(vec,collapse = '|')) ~ T,T ~ F)
) %>% 
  mutate(summaryFlag = case_when(
    str_detect(report,T ~ F))

在文本段落中搜索单词，然后在R中标记它们

如何解决在文本段落中搜索单词，然后在R中标记它们

解决方法

相关推荐