如何解决在 R 中解析“大陆/国家/城市”向量的快速方法 基准数据
我在 R 中有一个字符向量,每个字符串由“大陆/国家/城市”组成,例如
x=rep("Africa / Kenya / Nairobi",1000000)
但是“ / ”偶尔会在没有括号空格的情况下被错误输入为“/”,并且在某些情况下城市也丢失了,因此它会例如是“非洲/肯尼亚”,没有城市。
我想将其解析为三个向量 大洲、国家和城市,如果缺少城市,则使用 NA。
对于国家,我现在做了类似的事情
country = sapply(x,function(loc) trimws(strsplit(loc,"/",fixed = TRUE)[[1]][2]))
但是如果向量 x 很长,那会很慢。在 R 中解析它的有效方法是什么?
解决方法
您可以在 rbind
中尝试 do.call
。在 [
中使用 lapply
是为了在城市缺失的情况下获得 3 个结果。
x <- c("Africa / Kenya / Nairobi","Africa/Kenya/Nairobi","Africa / Kenya")
y <- do.call(rbind,lapply(strsplit(x,"/",TRUE),"[",1:3))
y <- trimws(y,whitespace = " ")
y
# [,1] [,2] [,3]
#[1,] "Africa" "Kenya" "Nairobi"
#[2,] "Africa" "Kenya" "Nairobi"
#[3,] "Africa" "Kenya" NA
或者使用data.table
:
x <- c("Africa / Kenya / Nairobi","Africa / Kenya")
y <- do.call(cbind,data.table::tstrsplit(x,TRUE))
y <- trimws(y,] "Africa" "Kenya" NA
基准
#x <- rep("Africa / Kenya / Nairobi",1000000) #Timings will depend on the used dataset
n <- 1e6L
f1 <- function(n) replicate(n,paste(sample(letters,sample(5:15,1),collapse = ""))
f2 <- function(n) sample(c("/"," /","/ "," / "),n,TRUE)
set.seed(42)
x <- paste0(f1(n),f2(n),f1(n),sample(c(paste0(f2(n%/%2L),f1(n%/%2L)),rep("",n - n%/%2L))))
system.time( #Method given in the question
sapply(x,function(loc) trimws(strsplit(loc,fixed = TRUE)[[1]][2])))
# User System verstrichen
# 47.718 0.004 47.798
system.time( #Using strsplit and trimws
trimws(do.call(rbind,1:3)),whitespace = " "))
# User System verstrichen
# 5.446 0.008 5.454
system.time( #Using data.table::tstrsplit and trimws
trimws(do.call(cbind,TRUE)),whitespace = " "))
# User System verstrichen
# 2.365 0.012 2.376
system.time( #Using readr::read_delim from @Anoushiravan R
readr::read_delim(x,delim = "/",quote = "",trim_ws = TRUE,col_names = FALSE))
# User System verstrichen
# 1.961 0.024 2.222
system.time( #Using data.table::tstrsplit with " */ *"
do.call(cbind," */ *",perl=TRUE)))
# User System verstrichen
# 1.394 0.000 1.394
system.time( #Using read.table from @akrun
read.table(text = x,sep = "/",header = FALSE,fill = TRUE,strip.white = TRUE,na.strings = ""))
# User System verstrichen
# 1.298 0.004 1.302
system.time( #Using data.table::fread from @akrun
data.table::fread(text = paste(x,collapse="\n"),sep="/",na.strings = ""))
# User System verstrichen
# 1.146 0.016 0.996
system.time( #Using read.table with additional argiments
read.table(text = x,na.strings = "",nrows=length(x),comment.char = "",colClasses = c("character")))
# User System verstrichen
# 1.076 0.000 1.076
system.time( #Using data.table::fread with stringr::str_c (or stringi::stri_c)
data.table::fread(text = stringr::str_c(x,na.strings = ""))
# User System verstrichen
# 0.780 0.000 0.624
使用 data.table::fread
并使用 stringr::str_c
创建输入字符串看起来是目前给定方法中最快的。
考虑使用 gcloud projects get-iam-policy $GCP_PROJECT_NAME \
--filter="serviceAccount" \
--flatten="bindings[].members" \
--format="value(bindings.members.split(':').slice(1:).flatten())" \
--sort-by=bindings.members | uniq
#=>
. . .
$SOME_SERVICE_ACCOUNT
. . .
中的 read.table
base R
或者使用 read.table(text = x,na.strings = "")
V1 V2 V3
1 Africa Kenya Nairobi
2 Africa Kenya Nairobi
3 Africa Kenya <NA>
中的 fread
data.table
基准
library(data.table)
fread(text = paste(x,na.strings = "")
Africa Kenya Nairobi
1: Africa Kenya Nairobi
2: Africa Kenya <NA>
数据
x <- rep("Africa / Kenya / Nairobi",1000000)
>
> system.time(fread(text = paste(x,na.strings = ""))
user system elapsed
0.473 0.024 0.496
> system.time(read.table(text = x,+ fill = TRUE,na.strings = ""))
user system elapsed
0.519 0.026 0.543
> system.time({ #Using data.table
+ y <- do.call(cbind,TRUE))
+ y <- trimws(y,whitespace = " ")
+ })
user system elapsed
2.035 0.051 2.067
,
我认为这也可以使用:
library(readr)
xx <- readr::read_delim(b,col_names = FALSE)
# A tibble: 3 x 3
X1 X2 X3
<chr> <chr> <chr>
1 Africa Kenya Nairobi
2 Africa Kenya Nairobi
3 Africa Kenya NA
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。