如何解决合并字符串列上的两个数据框,其值包含通配符,就像 SQL - Python
我想将字符串列上的 2 个数据帧与包含通配符的值合并,就像我们在 SQL 中所做的那样。
示例:
import pandas as pd
df1 = pd.DataFrame({'A': ["He eat an apple in his office.","There are many apples on the tree."],'B': [1,2]})
df2 = pd.DataFrame({'A': ["apple*tree","apple*pie"],'C': [4,9]})
df1
A B
0 He eat an apple in his office. 1
1 There are many apples on the tree. 2
df2
A C
0 apple*tree 4
1 apple*pie 9
pd.merge(df1,df2,on = ['A'])
# What it gives me :
Empty DataFrame
Columns: [A,B,C]
Index: []
# What I want:
A B C
0 There are many apples on the tree. 2 4
我想加入两个数据框,df2的“apple*tree”必须匹配“树上有很多苹果”这句话。 df1。
你能帮我做这件事吗?
我找到了函数 fnmatch.fnmatch(string,pattern) 但我可以在这种情况下使用它进行合并吗?
解决方法
这可以通过使用 apply 在 df1 的每一行中搜索 df2 的模式来完成。这将需要与 O(n*m)
成比例的运行时间,其中 n 是 df1 中的行数,m 是 df2 中的行数。这不是很有效,但对于小数据帧来说很好。
一旦我们确定了 df1 和 df2 之间的匹配项,我们就可以合并两个数据帧。之后,我们只需要清理数据框并删除不需要的列。
代码:
import pandas as pd
import fnmatch
df1 = pd.DataFrame({'A': ["He eat an apple in his office.","There are many apples on the tree."],'B': [1,2]})
df2 = pd.DataFrame({'A': ["apple*tree","apple*pie"],'C': [4,9]})
def wildcard_search(pattern):
# Comment this line to require exact match
pattern = "*" + pattern + "*"
# Apply pattern to every A values within df1
matching = df1['A'].apply(lambda x: fnmatch.fnmatch(x,pattern))
# Get index of largest member
first_match = matching.idxmax()
# If we have all zeros,then first_match will refer to the first
# zero. Check for this.
if matching.loc[first_match] == 0:
return None
# print(first_match)
return df1.loc[first_match,'A']
# Using df2 patterns,search through df1. Record values found.
df2['merge_key'] = df2['A'].apply(wildcard_search)
# Merge two dataframes,on cols merge_key and A
res = df2.merge(
df1,left_on='merge_key',right_on='A',suffixes=("_x","") # Don't add a suffix to df1's columns
)
# Reorder cols,drop unneeded
res = res[['A','B','C']]
print(res)
这个答案改编自this post。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。