Python合并多个子字符串

如何解决Python合并多个子字符串

我有以下数据框：

import pandas as pd

df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'],'Color':['Red','Blue','Green','Black','Yellow'],'Tel':['3745 569','785 985','635 565a','987',np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985',np.nan,'569','459 56']})

我想：

找到存储在df1 ['Tel']和df2 ['Phone]列中的公用子字符串值
左合并df2，输出电话，在df1 ['Tel']和df1 ['Colour']列中找到的公用子字符串值。

预期结果：

我找到并编辑的代码仅在没有NaN值的情况下有效，并且无法搜索键是否为子串，例如我的情况：

a = ['Tel','Phone']
b = [1,2]
rhs ={}

for x,y in zip(a,b):
    rhs[y] = (df1[x].apply(lambda x: df2[df2['Phone'].str.find(x).ge(0)]['colour']).bfill(axis=1).iloc[:,0])

解决方法

因此，如果我理解正确，则希望通过公共子字符串合并。这段代码可以做到这一点，尽管不是很优雅。但是我明确指出了潜在的陷阱：此代码假定匹配项的最长子字符串（匹配项可能更短，并且实际上可能存在多个具有相同公共长度的匹配项；该代码无法处理该问题，RosettaCode的最长公共子字符串， ref。

import pandas as pd
import numpy as np

# https://rosettacode.org/wiki/Longest_common_substring#Python
def longestCommon(s1,s2):
    len1,len2 = len(s1),len(s2)
    ir,jr = 0,-1
    for i1 in range(len1):
        i2 = s2.find(s1[i1])
        while i2 >= 0:
            j1,j2 = i1,i2
            while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
                if j1-i1 >= jr-ir:
                    ir,jr = i1,j1
                j1 += 1; j2 += 1
            i2 = s2.find(s1[i1],i2+1)
    return len(s1[ir:jr+1])

df1 = pd.DataFrame({'Name':['Jon','Alex','Jenny','Rick','Joe'],'Color':['Red','Blue','Green','Black','Yellow'],'Tel':['3745 569','785 985','635 565a','987',np.nan]})
df2 = pd.DataFrame({'Phone':['987 856','985',np.nan,'569','459 56']})

# left merge df2 to df1 via longest matching substring Tel to Phone
mrglst = []
for phone in df2['Phone']:
    lgstr = 0
    lgtel = ''
    lgcol = ''
    for tidx,trow in df1.iterrows():
        if str(phone) != 'nan' and str(trow['Tel']) != 'nan':
            thisstrl = longestCommon(phone,trow['Tel'])
            if thisstrl > lgstr:
                lgstr = thisstrl
                lgtel,lgcol = trow['Tel'],trow['Color']
    mrglst.append([phone,lgtel,lgcol])
    
dfmrg = pd.DataFrame(mrglst,columns=['Phone','Tel','Color'])
print(dfmrg)

这产生

     Phone       Tel  Color
0  987 856       987  Black
1      985   785 985   Blue
2      NaN                 
3      569  3745 569    Red
4   459 56  3745 569    Red

这几乎是所需的输出，但对于最后一行：_56匹配到_56 Tel：这是正确的，但可能需要唯一的数字匹配。在这种情况下，最好在比赛之前清理电话号码（其中一个号码的末尾还有一个“ a”，所以我去进行了常规的字符串比赛）。

Python合并多个子字符串

如何解决Python合并多个子字符串

解决方法

相关推荐