如何解决在spaCy的名词块/实体中删除领先的确定器
我正在尝试使用新的实体类型来引导第一组训练数据,以便与spaCy的NER模型一起使用。我现有的大多数示例都由单个单词实体组成,但是我试图将它们合并以获得更具体的概念。
获取给定的可接受实体样本和测试字符串(请参阅底部的完整代码):
ent_list_sample = ['algorithm','data','engineering','software']
test_string = "We introduce a software-engineering inspired classification algorithm for dealing with bioinformatics data."
在这种特殊情况下,将ent_list_sample
中的单词与spaCy的EntityRuler
结合使用,然后与doc.noun_chunk
跨度合并,可以使实体更容易接受。
print(doc.ents)
# (a software-engineering inspired classification algorithm,bioinformatics data)
问题:如何从第一个实体中删除确定者a
,并将其设置为“软件工程启发分类算法”? spaCy如何处理名词块中的领先确定者?如果我现有的大多数实体都是单个单词,EntityRuler
是否适合此引导任务?
MWE代码
import spacy
from spacy.pipeline import EntityRuler
from spacy.util import filter_spans
ent_list_sample = ['algorithm','software']
test_string = "We introduce a software-engineering inspired classification algorithm for dealing with bioinformatics data."
print("test_string:\n\t",test_string,"\n")
print("Default:\n-----------")
nlp = spacy.load("en")
doc = nlp(test_string)
print("Noun chunks:")
print(list(doc.noun_chunks),"\n")
print("Entities:")
print(doc.ents,"\n-------------------------------------------------------\n\n")
print("Adding patterns to EntityRuler:\n-----------")
patterns = []
for concept in ent_list_sample:
doc = nlp.make_doc(concept)
if len(doc) > 1:
patterns.append({"label": "SCI","pattern":[{"LOWER":term.text.lower()} for term in doc]})
else:
patterns.append({"label": "SCI","pattern":doc.text.lower()})
ruler = EntityRuler(nlp)
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)
doc = nlp(test_string)
print("Entities:")
print(doc.ents)
print(list(ent.label_ for ent in doc.ents),"\n-------------------------------------------------------\n\n")
print("Merge entities with retokenizer:\n-----------")
spans = list(doc.ents) + list(doc.noun_chunks)
spans = filter_spans(spans)
list(doc.noun_chunks)
with doc.retokenize() as retokenizer:
for span in spans:
retokenizer.merge(span)
print("Entities:")
print(doc.ents)
print(list(ent.label_ for ent in doc.ents))
MWE输出
test_string:
We introduce a software-engineering inspired classification algorithm for dealing with bioinformatics data.
Default:
-----------
Noun chunks:
[We,a software-engineering inspired classification algorithm,bioinformatics data]
Entities:
()
-------------------------------------------------------
Adding patterns to EntityRuler:
-----------
Entities:
(software,engineering,algorithm,data)
['SCI','SCI','SCI']
-------------------------------------------------------
Merge entities with retokenizer:
-----------
Entities:
(a software-engineering inspired classification algorithm,bioinformatics data)
['SCI','SCI']
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。