长句模糊匹配

如何解决长句模糊匹配

假设我有以下数据框：

ID       CompanyName         JobDescription
1        Green Grass LLC     "In the centre of Green Grass area..."
2        Johnny Inc.          "Johnny is currently looking for data analist that..."
3        Liamloy             "LiamLoy Corp. is established in New York..."
4        KaasKan             "In the forest we are walking..."

我的主要目标是排除每个 CompanyName 中的 JobDescription。所需的输出是：

ID       CompanyName         JobDescription
1        Green Grass LLC     "In the centre of area..."
2        Johnny Inc.          "is currently looking for data analist that..."
3        Liamloy             "is established in New York..."
4        KaasKan             "In the forest we are walking"

我已尝试 word tokenize JobDescription（将句子转换为单词）并应用 fuzzymatching 来检测和删除匹配项。然而，这并不是很成功。例如，在标记第三个 JobDescription 时。 “Liamloy”与“LiamLoy”和“Corp.”进行比较。也许这种方法并不理想。我现在不知道。我想知道你们中是否有人愿意分享他们的意见并启发我如何成功删除每个 CompanyName 中的 JobDescription。

解决方法

如果您不希望公司名称中的单词被交换，我建议使用内置的python库difflib来查找两个字符串的公共部分并替换为掩码。>

def find_matching_spans(a,b,min_match=3,max_mismatch=1):
    """ Find the spans in the string b that are similar to the string a"""
    prev_match = 0
    match = 0
    mismatch = 0
    i = 0
    span_start = 0
    prev_start = 0
    span_end = 0
    spans = []
    common = []
    
    def add_span():
        if prev_match > min_match:
            if spans and spans[-1][-1] >= prev_start - 2:
                spans[-1][-1] = span_end
            else:
                spans.append([prev_start,span_end])
    
    for item in difflib.ndiff(a.lower(),b.lower()):
        if item[0] == ' ' and item[2] != ' ':
            match += 1
            mismatch = 0
            if match == 1:
                span_start = i
                common = []
            common.append(item[2])
        elif item[0] == '+' or item[2] == ' ':
            if match > min_match:
                add_span()
                prev_start = span_start
                prev_match = match
                span_end = i
            match = 0
            mismatch += 1
            if mismatch > max_mismatch:
                add_span()
                prev_match = 0
        elif item[0] == '-':
            pass
        if item[0] in {' ','+'}:
            i += 1
    return spans


def replace_spans(text,spans,replacement):
    spans = [[0,0]] + spans + [[len(text),len(text)]]
    parts = []
    for i in range(1,len(spans)):
        parts.append(text[spans[i-1][1]:spans[i][0]])
        if i < len(spans) - 1:
            parts.append('XXX')
    return ''.join(parts)


def replace_name(a,replacement='XXX'):
    b_prev = None
    while b_prev != b:
        spans = find_matching_spans(a,b)
        b_prev = b
        b = replace_spans(b,replacement)
    return b

它会像这样工作：

print(replace_name("Green Grass LLC","In the centre of Green Grass area..."))
print(replace_name("Johnny Inc.","Johnny is currently looking for data analist that..."))
print(replace_name("Liamloy","LiamLoy Corp. is established in New York..."))
print(replace_name("KaasKan","In the forest we are walking..."))

并产生输出

In the centre of XXX area...
XXX is currently looking for data analist that...
XXX Corp. is established in New York...
In the forest we are walking...

为什么不使用正则表达式？

import re


def replace_company_name(company_name,text):
    sanitized_text = re.sub(company_name,'',text)
    return sanitized_text

由于 Liamloy 的示例，听起来您还需要考虑公司名称后缀，例如 corp。

解决此问题的一种方法是使用一组通用的公司名称后缀常量。您还应该注意到，我使用了忽略大小写标志，因为查看 Liamloy 的行，公司名称是 Liamloy，而在职位描述中它是 LiamLoy。后缀的大写方式也可能存在差异（INC、Inc、inc 等）

COMPANY_NAME_POSTFIXES = '|'.join(['INC','CORP','LLC','LTD'])


def replace_company_name(company_name,text):

    # 1. replace any postfixes in the company name. E.G. Green Grass LLC. -> Green Grass 
    company_name_post_fixregex = rf'({COMPANY_NAME_POSTFIXES})?\\.?'
    sanitized_company_name = re.sub(company_name_postfix_regex,company_name,flags=re.IGNORECASE)
    # 2. replace any instances of the sanitized company name followed optionally by both a space and a company name postfix
    search_string = rf'{sanitized_company_name}\\s?{company_name_postfix_regex}'
    sanitized_text = re.sub(search_string,text,flags=re.IGNORECASE)
    return sanitized_text

上述方法会导致替换使用的不是公司名称的单词实例的副作用。例如。对于 Green Grass LLC “在 Green Grass 区域的中心，有很多被照料的绿草”-> 在该区域的中心，有很多被照料

如果不希望出现这种副作用，您需要清理公司名称大写的职位描述或计算并传入公司名称数组。

如何解决长句模糊匹配

解决方法

相关推荐