如何解决字典中的 n-gram 模糊匹配
给定一个可变长度的字符串 S 和一个 n-grams N 的字典 D,我想:
- 提取 S 中与模糊匹配逻辑匹配的所有 N(以捕获拼写错误)
- 提取 S 中的所有数字
- 按照与 S 中相同的顺序显示结果
我完成了第 1 点和第 2 点,但是我的方法基于从 S 创建 n-gram 并根据字典进行模糊匹配(加上数字匹配)并没有保持项目在 S 中的顺序>
from nltk import everygrams
from flask_caching import Cache
import re
string = "Hello everybody,today we have 2.000 cell phones here"
ngrams = (list(everygrams(string.split(),1,4)))
my_dict = {
"brand": "ITEM_01","model": "ITEM_02","cell phone": "ITEM_04","today" : "ITEM_05"
}
result=""
results=[] # list with final results
d = FuzzyDict(my_dict) # create the dictionary for fuzzy matching
for k in ngrams:
candidate = ' '.join(k)
print (f"Searching for {candidate}")
try:
#matching n-gram in Dictionary using fuzzy match
result = d[candidate]
print (f"Found {result}")
results.append(result)
except:
print("An exception occurred")
#matching complex numbers
numbers = re.findall(r'(?:[+-]|\()?\$?\d+(?:,\d+)*(?:\.\d+)?\)?',candidate)
#appending numbers to list
results.extend(numbers)
#NOTE chronological order is not kept!
#keeping unque values since my approach will extract several instances of the same item
myset = set(results)
results_unique = list(myset)
这应该给我“ITEM_5 2.000 ITEM_4”(现在订单是随意的)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。