如何解决清理很长的公司名称列表-在data.table的每一行上应用功能
我有一个data.table,其中包含公司名称和地址信息。我想从公司名称中删除法人实体和最常用的词。 因此,我编写了一个函数并将其应用于我的data.table。
search_for_default <- c("inc","corp","co","llc","se","\\&","holding","professionals","services","international","consulting","the","for")
clean_strings <- function(string,search_for=search_for_default){
clean_step1 <- str_squish(str_replace_all(string,"[:punct:]"," ")) #remove punctation
clean_step2 <- unlist(str_split(tolower(clean_step1)," ")) #split in tokens
clean_step2 <- clean_step2[!str_detect(clean_step2,"^american|^canadian")] # clean up geographical names
res <- str_squish(str_c(clean_step2[!clean_step2 %in% search_for],sep="",collapse=" ")) #remove legal entities and common words
res <- paste(unique(unlist(str_split(res," "))),collapse=" ") # paste string together
return(res) }
datatable[,COMPANY_NAME_clean:=clean_strings(COMPANY_NAME),by=COMPANY_NAME]
脚本运行良好。但是,当我有一个大数据集(> 3b行)时,它会花费很长时间。 有更有效的方法吗?
示例:
输入:
Company_Name <- c("Walmart Inc.","Amazon.com,Inc.","Apple Inc.","American Test Company for Consulting")
预期:
Company_name_clean <- c("walmart","amazon.com","apple","test company")
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。