如何解决如何在R中的data.table列中向量化最长的公共子字符串
如何创建允许我快速计算最长公共子字符串中字符数的函数,或者在R中的大数据表中返回两个或更多列之间的最长公共子字符串? / strong>
我修改了这个问题的答案:Find length of overlap in strings,但有1.)跨向量应用的问题,因为当应用sapply创建新的结果列时,这会导致空白和其他字符串特征失败,2。)跨向量的问题超过2列,以及3.)给定的答案在潜在匹配项中不包含空格,我希望这样做。该功能也很慢,我想将其应用于大数据。
创建示例数据:
sampdata <- data.frame(
str1=c("Doug Olivas","GRANT MANAGEMENT LLC","LUNA VAN DERESH","wendy t marzardo","AMIN NYGUEN COMPANY LLC","GERARDO CONTRARAS","miguel martinez","albert marks porter"),str2=c("doug olivas","miguel grant","LUNA VAN DERESH MANAGEMENT LLC","marzardo","amin nyguen llc","gerardo contraras","miggy martinez","albert"),str3=c("Martin Olivas","GRANT PROPERTIES","luna company","wendy marzardo","the company of amin nyguen llc","gerardo c","miguel t martinez","")
)
弥补功能“ lcsfoo”所需的功能1:
#option type="nchar" to return number of characters INCLUDING SPACES,IGNORING CASE in max common substring
sampdata$desired_LCSnchar <- lcsfoo(sampdata$str1,sampdata$str2,sampdata$str3,type="nchar")
#option type="str" to return the string INCLUDING SPACES,IGNORING CASE of the longest common substring between the columns
sampdata$desired_LCSstr <- lcsfoo(sampdata$str1,type="str")
#期望结果1:以上将返回以下示例数据
sampdata$desired_LCSnchar <- c(7,5,8,12,9,0)
sampdata$desired_LCSstr<- c(" olivas","grant","luna ","amin nyguen "," martinez","")
** IDEALLY lcsfoo也将采用可变数量的列输入(即此处为2列,而不是上面的3列):
sampdata$str1str2_LCSnchar <- lcsfoo(sampdata$str1,type="nchar")
sampdata$str1str2_LCSstr <- lcsfoo(sampdata$str1,type="str")
#期望结果2:上面的示例数据将返回以下内容
sampdata$str1str2_LCSstr<- c("doug olivas","luna van deresh","albert")
sampdata$str1str2_LCSnchar <- c(11,15,17,6)
我还需要该功能来处理大数据:
library(data.table)
###Create sample big data from previous sampledata and apply on huge DT
samplist <- lapply(c(1:1000),FUN=function(x){sampdata})
bigsampdata <- rbindlist(samplist)
DESIRED FUNCTION APPLIED ON BIG DATA:
bigsampdata$desired_LCSnchar <- lcsfoo(bigsampdata$str1,bigsampdata$str2,bigsampdata$str3,type="nchar")
bigsampdata$desired_LCSstr <- lcsfoo(bigsampdata$str1,type="str")
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。