如何解决R:如何匹配或过滤具有相同字符串但顺序不同的变量?
我有一个数据集,其中包含两个由全名(姓名和姓氏)组成的变量。但是,这两个变量的顺序不同:
-
variable1
由 排序
-
variable2
由 排序
如何过滤行,使variable1
= variable2
?还是可以修改变量2的顺序以匹配变量1的顺序?
我创建了一个小样本来复制数据集(注意,一些全名包含3个或更多的单词):
library(tidyverse)
name_surname <- c("John Smith One","Jane Smith Two","John Doe","Nick Doe","Chris Froome","Van den Broeck","Lance","Van Dae Le Phillipe")
surname_name <- c("Smith One John","Smith Two Jane","Doe John","Froome Chris","Broeck Van den","Phillipe Van Dae Le")
tibble <- tibble(variable1 = name_surname,variable2 = surname_name)
tibble
#> # A tibble: 8 x 2
#> variable1 variable2
#> <chr> <chr>
#> 1 John Smith One Smith One John
#> 2 Jane Smith Two Smith Two Jane
#> 3 John Doe Doe John
#> 4 Nick Doe Nick Doe
#> 5 Chris Froome Froome Chris
#> 6 Van den Broeck Broeck Van den
#> 7 Lance Lance
#> 8 Van Dae Le Phillipe Phillipe Van Dae Le
由reprex package(v0.3.0)于2020-08-25创建
会话信息devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os macOS Catalina 10.15.5
#> system x86_64,darwin17.0
#> ui X11
#> language (EN)
#> collate en_AU.UTF-8
#> ctype en_AU.UTF-8
#> tz Australia/Melbourne
#> date 2020-08-25
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.8 2020-06-17 [1] CRAN (R 4.0.2)
#> blob 1.2.1 2020-01-20 [1] CRAN (R 4.0.2)
#> broom 0.7.0 2020-07-09 [1] CRAN (R 4.0.2)
#> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.2)
#> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.2)
#> dbplyr 1.4.4 2020-05-27 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.1 2020-07-21 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.1 2020-07-31 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> ggplot2 * 3.3.2 2020-06-19 [1] CRAN (R 4.0.2)
#> glue 1.4.1 2020-05-13 [1] CRAN (R 4.0.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
#> haven 2.3.1 2020-06-01 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
#> jsonlite 1.7.0 2020-06-25 [1] CRAN (R 4.0.2)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> lubridate 1.7.9 2020-06-08 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.0.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.3 2020-07-05 [1] CRAN (R 4.0.2)
#> ps 1.3.3 2020-05-08 [1] CRAN (R 4.0.2)
#> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.2)
#> readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.2)
#> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> rvest 0.3.6 2020-07-25 [1] CRAN (R 4.0.2)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.2)
#> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble * 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyr * 1.1.1 2020-07-31 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.2)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.2)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
#> vctrs 0.3.2 2020-07-15 [1] CRAN (R 4.0.2)
#> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.2)
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2)
#> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library
解决方法
根据variable2
在空间和顺序variable1
上拆分变量。
tibble$variable3 <- mapply(function(x,y) paste(y[match(x,y)],collapse = " "),strsplit(tibble$variable1,'\\s+'),strsplit(tibble$variable2,'\\s+'))
tibble
# A tibble: 8 x 3
# variable1 variable2 variable3
# <chr> <chr> <chr>
#1 John Smith One Smith One John John Smith One
#2 Jane Smith Two Smith Two Jane Jane Smith Two
#3 John Doe Doe John John Doe
#4 Nick Doe Nick Doe Nick Doe
#5 Chris Froome Froome Chris Chris Froome
#6 Van den Broeck Broeck Van den Van den Broeck
#7 Lance Lance Lance
#8 Van Dae Le Phillipe Phillipe Van Dae Le Van Dae Le Phillipe
出于比较目的创建了新变量(variable3
),如果需要,您可以在variable2
中覆盖tibble
。
与@Ronak Shah相似的逻辑,但是使用dplyr
和tidyr
:
tibble %>%
rowid_to_column() %>%
separate_rows(variable1,variable2) %>%
group_by(rowid) %>%
mutate(variable2 = variable2[match(variable1,variable2)]) %>%
summarise(across(starts_with("variable"),paste,collapse = " "))
rowid variable1 variable2
<int> <chr> <chr>
1 1 John Smith One John Smith One
2 2 Jane Smith Two Jane Smith Two
3 3 John Doe John Doe
4 4 Nick Doe Nick Doe
5 5 Chris Froome Chris Froome
6 6 Van den Broeck Van den Broeck
7 7 Lance Lance
8 8 Van Dae Le Phillipe Van Dae Le Phillipe
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。