合并字符串列上的两个数据框，其值包含通配符，就像 SQL - Python

如何解决合并字符串列上的两个数据框，其值包含通配符，就像 SQL - Python

我想将字符串列上的 2 个数据帧与包含通配符的值合并，就像我们在 SQL 中所做的那样。

示例：

import pandas as pd

df1 = pd.DataFrame({'A': ["He eat an apple in his office.","There are many apples on the tree."],'B': [1,2]})
df2 = pd.DataFrame({'A': ["apple*tree","apple*pie"],'C': [4,9]})

df1
                                    A  B
0      He eat an apple in his office.  1
1  There are many apples on the tree.  2

df2
            A  C
0  apple*tree  4
1   apple*pie  9


pd.merge(df1,df2,on = ['A']) 

# What it gives me :

Empty DataFrame
Columns: [A,B,C]
Index: []


# What I want:
                                    A  B  C
0  There are many apples on the tree.  2  4

我想加入两个数据框，df2的“apple*tree”必须匹配“树上有很多苹果”这句话。 df1。

你能帮我做这件事吗？

我找到了函数 fnmatch.fnmatch(string,pattern) 但我可以在这种情况下使用它进行合并吗？

解决方法

这可以通过使用 apply 在 df1 的每一行中搜索 df2 的模式来完成。这将需要与 O(n*m) 成比例的运行时间，其中 n 是 df1 中的行数，m 是 df2 中的行数。这不是很有效，但对于小数据帧来说很好。

一旦我们确定了 df1 和 df2 之间的匹配项，我们就可以合并两个数据帧。之后，我们只需要清理数据框并删除不需要的列。

代码：

import pandas as pd
import fnmatch

df1 = pd.DataFrame({'A': ["He eat an apple in his office.","There are many apples on the tree."],'B': [1,2]})
df2 = pd.DataFrame({'A': ["apple*tree","apple*pie"],'C': [4,9]})

def wildcard_search(pattern):
    # Comment this line to require exact match
    pattern = "*" + pattern + "*"
    # Apply pattern to every A values within df1
    matching = df1['A'].apply(lambda x: fnmatch.fnmatch(x,pattern))
    # Get index of largest member
    first_match = matching.idxmax()
    # If we have all zeros,then first_match will refer to the first
    # zero. Check for this.
    if matching.loc[first_match] == 0:
        return None
    # print(first_match)
    return df1.loc[first_match,'A']

# Using df2 patterns,search through df1. Record values found.
df2['merge_key'] = df2['A'].apply(wildcard_search)

# Merge two dataframes,on cols merge_key and A
res = df2.merge(
    df1,left_on='merge_key',right_on='A',suffixes=("_x","")  # Don't add a suffix to df1's columns
)
# Reorder cols,drop unneeded
res = res[['A','B','C']]
print(res)

这个答案改编自this post。

合并字符串列上的两个数据框，其值包含通配符，就像 SQL - Python

如何解决合并字符串列上的两个数据框，其值包含通配符，就像 SQL - Python

解决方法

相关推荐