如何解决如何取消列出字符串列以计算匹配项?
我希望计算2个数据集之间是否有匹配的字符串。这是一个数据集具有一列基因,而另一列基因与这些基因相互作用的结果。
例如:
#dataset1
Gene Interactors
ACE BRCA2,NOS2,SEPT9
HER2 AGT,TGRF
YUO SEPT9,TET2
我还有第二个数据集,其中也有与此相似的基因和相互作用基因。例如:
#dataset2
Gene Interactors
RTY ADFD,NOS3,SEPT9
TERT ADAM2,GERP
GHJ TET2,NOS2
我希望能够计算出数据集1中有多少Interactors
与数据集2中具有匹配的Interactors
。
示例输出:
Gene Interactors Secondary_interaction_count
ACE BRCA2,SEPT9 2 #SEPT9 and NOS2 are in the 2nd dataset under interacting genes
HER2 AGT,TGRF 0
YUO SEPT9,ADAM2,TET2 3 #all 3 are in dataset 2
目前,我有2个版本可以尝试获得。只能给出对或错,我不知道如何更改为计数的一种:
temp <- unlist(strsplit(df2$interactors,','))
df1$secondary_count <- sapply(strsplit(df1$interactors,'),function(x) any(x %in% temp))
我认为不会拆分字符串的另一个,但是我不确定如何修改它:
df1 %>%
mutate(secondary_count = str_count(interactors,str_c(df2$interactors,collapse = '|')))
是否可以修改这两种编码尝试中的任何一种以获得计数?还是我应该尝试另一种方法?
输入数据:
#df1:
structure(list(Gene = c("ACE","HER2","YUO"),Interactors = c("BRCA2,SEPT9","AGT,TGRF","SEPT9,TET2")),row.names = c(NA,-3L),class = c("data.table","data.frame"))
#df2:
structure(list(Gene = c("RTY","TERT","GHJ"),Interactors = c("ADFD,"ADAM2,GERP","TET2,NOS2")),"data.frame"))
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252
[2] LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sqldf_0.4-11 RSQLite_2.2.1 gsubfn_0.7 proto_1.0.0
[5] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
[9] tidyr_1.1.2 tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0
[13] plyr_1.8.6 dplyr_1.0.2 data.table_1.13.2
loaded via a namespace (and not attached):
[1] gtools_3.8.2 tidyselect_1.1.0 haven_2.3.1 tcltk_4.0.2
[5] colorspace_1.4-1 vctrs_0.3.4 generics_0.0.2 chron_2.3-56
[9] blob_1.2.1 rlang_0.4.8 pillar_1.4.6 glue_1.4.1
[13] withr_2.3.0 DBI_1.1.0 bit64_4.0.5 dbplyr_1.4.4
[17] modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 munsell_0.5.0
[21] gtable_0.3.0 cellranger_1.1.0 rvest_0.3.6 memoise_1.1.0
[25] fansi_0.4.1 broom_0.7.2 Rcpp_1.0.5 scales_1.1.1
[29] backports_1.1.10 jsonlite_1.7.1 fs_1.5.0 bit_4.0.4
[33] hms_0.5.3 digest_0.6.27 stringi_1.5.3 grid_4.0.2
[37] cli_2.1.0 tools_4.0.2 magrittr_1.5 crayon_1.3.4
[41] pkgconfig_2.0.3 ellipsis_0.3.1 xml2_1.3.2 reprex_0.3.0
[45] lubridate_1.7.9 assertthat_0.2.1 httr_1.4.2 rstudioapi_0.11
[49] R6_2.4.1 compiler_4.0.2
解决方法
尝试一下
library(tidyr)
library(dplyr)
sep_rows <- . %>% separate_rows(Interactors,sep = ",")
df1 %>%
sep_rows() %>%
mutate(
found = !is.na(match(Interactors,sep_rows(df2)$Interactors))
) %>%
group_by(Gene) %>%
summarise(
Interactors = toString(Interactors),Secondary_interaction_count = sum(found)
)
输出
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 3
Gene Interactors Secondary_interaction_count
<chr> <chr> <int>
1 ACE BRCA2,NOS2,SEPT9 2
2 HER2 AGT,TGRF 0
3 YUO SEPT9,TET2 3
,
另一种尝试:
> df1 %>% separate_rows(Interactors) %>% rowwise() %>%
+ mutate(secondary_interactions = str_extract_all(Interactors,paste0(df2 %>% separate_rows(Interactors) %>% pull(Interactors),collapse = '|'))) %>%
+ unnest(secondary_interactions,keep_empty = T) %>% group_by(Gene) %>%
+ mutate(Interactors = toString(Interactors),secondary_interactions_cnt = case_when(is.na(secondary_interactions) ~ 0,TRUE ~ 1)) %>%
+ mutate(secondary_interactions = sum(secondary_interactions_cnt)) %>% select(-4)%>% distinct()
# A tibble: 3 x 3
# Groups: Gene [3]
Gene Interactors secondary_interactions
<chr> <chr> <dbl>
1 ACE BRCA2,SEPT9 2
2 HER2 AGT,TGRF 0
3 YUO SEPT9,TET2 3
>
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。