如何解决长句模糊匹配
假设我有以下数据框:
ID CompanyName JobDescription
1 Green Grass LLC "In the centre of Green Grass area..."
2 Johnny Inc. "Johnny is currently looking for data analist that..."
3 Liamloy "LiamLoy Corp. is established in New York..."
4 KaasKan "In the forest we are walking..."
我的主要目标是排除每个 CompanyName
中的 JobDescription
。所需的输出是:
ID CompanyName JobDescription
1 Green Grass LLC "In the centre of area..."
2 Johnny Inc. "is currently looking for data analist that..."
3 Liamloy "is established in New York..."
4 KaasKan "In the forest we are walking"
我已尝试 word tokenize
JobDescription
(将句子转换为单词)并应用 fuzzymatching
来检测和删除匹配项。然而,这并不是很成功。例如,在标记第三个 JobDescription
时。 “Liamloy
”与“LiamLoy
”和“Corp
.”进行比较。也许这种方法并不理想。我现在不知道。我想知道你们中是否有人愿意分享他们的意见并启发我如何成功删除每个 CompanyName
中的 JobDescription
。
解决方法
如果您不希望公司名称中的单词被交换,我建议使用内置的python库difflib来查找两个字符串的公共部分并替换为掩码。>
def find_matching_spans(a,b,min_match=3,max_mismatch=1):
""" Find the spans in the string b that are similar to the string a"""
prev_match = 0
match = 0
mismatch = 0
i = 0
span_start = 0
prev_start = 0
span_end = 0
spans = []
common = []
def add_span():
if prev_match > min_match:
if spans and spans[-1][-1] >= prev_start - 2:
spans[-1][-1] = span_end
else:
spans.append([prev_start,span_end])
for item in difflib.ndiff(a.lower(),b.lower()):
if item[0] == ' ' and item[2] != ' ':
match += 1
mismatch = 0
if match == 1:
span_start = i
common = []
common.append(item[2])
elif item[0] == '+' or item[2] == ' ':
if match > min_match:
add_span()
prev_start = span_start
prev_match = match
span_end = i
match = 0
mismatch += 1
if mismatch > max_mismatch:
add_span()
prev_match = 0
elif item[0] == '-':
pass
if item[0] in {' ','+'}:
i += 1
return spans
def replace_spans(text,spans,replacement):
spans = [[0,0]] + spans + [[len(text),len(text)]]
parts = []
for i in range(1,len(spans)):
parts.append(text[spans[i-1][1]:spans[i][0]])
if i < len(spans) - 1:
parts.append('XXX')
return ''.join(parts)
def replace_name(a,replacement='XXX'):
b_prev = None
while b_prev != b:
spans = find_matching_spans(a,b)
b_prev = b
b = replace_spans(b,replacement)
return b
它会像这样工作:
print(replace_name("Green Grass LLC","In the centre of Green Grass area..."))
print(replace_name("Johnny Inc.","Johnny is currently looking for data analist that..."))
print(replace_name("Liamloy","LiamLoy Corp. is established in New York..."))
print(replace_name("KaasKan","In the forest we are walking..."))
并产生输出
In the centre of XXX area...
XXX is currently looking for data analist that...
XXX Corp. is established in New York...
In the forest we are walking...
,
为什么不使用正则表达式?
import re
def replace_company_name(company_name,text):
sanitized_text = re.sub(company_name,'',text)
return sanitized_text
由于 Liamloy 的示例,听起来您还需要考虑公司名称后缀,例如 corp。
解决此问题的一种方法是使用一组通用的公司名称后缀常量。您还应该注意到,我使用了忽略大小写标志,因为查看 Liamloy 的行,公司名称是 Liamloy,而在职位描述中它是 LiamLoy。后缀的大写方式也可能存在差异(INC、Inc、inc 等)
COMPANY_NAME_POSTFIXES = '|'.join(['INC','CORP','LLC','LTD'])
def replace_company_name(company_name,text):
# 1. replace any postfixes in the company name. E.G. Green Grass LLC. -> Green Grass
company_name_post_fixregex = rf'({COMPANY_NAME_POSTFIXES})?\\.?'
sanitized_company_name = re.sub(company_name_postfix_regex,company_name,flags=re.IGNORECASE)
# 2. replace any instances of the sanitized company name followed optionally by both a space and a company name postfix
search_string = rf'{sanitized_company_name}\\s?{company_name_postfix_regex}'
sanitized_text = re.sub(search_string,text,flags=re.IGNORECASE)
return sanitized_text
上述方法会导致替换使用的不是公司名称的单词实例的副作用。例如。对于 Green Grass LLC “在 Green Grass 区域的中心,有很多被照料的绿草”-> 在该区域的中心,有很多被照料
如果不希望出现这种副作用,您需要清理公司名称大写的职位描述或计算并传入公司名称数组。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。