如何解决在Python中动态获取两个或多个索引之间的元素,而无需对索引变量进行硬编码
我正在尝试从输入文本中提取POS标签,并提取2个或更多“ IN”标签之间的所有单词。因此,想法是,如果有1个“ IN”标签,则从标签的索引到句子的结尾进行提取。如果有超过2个“ IN”标签,则应从1个标签的索引提取到另一个“ IN”标签,将短语分成几组。我编写了相同的代码。 代码是:
def extractor(text):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
# print(pos_tagged)
# Get tuple index of preposition
indices = [i for i,tupl in enumerate(pos_tagged) if tupl[1] == 'IN']
# print(indices)
if len(indices) == 1:
idx = indices[0]
phrase = pos_tagged[idx:]
words = [i[0] for i in phrase]
comb_words = ' '.join(i for i in words)
return comb_words
else:
idx1 = indices[0]
idx2 = indices[1]
phrase1 = pos_tagged[idx1:idx2]
words1 = [i[0] for i in phrase1]
comb_words1 = ' '.join(i for i in words1)
phrase2 = pos_tagged[idx2:]
words2 = [i[0] for i in phrase2]
comb_words2 = ' '.join(i for i in words2)
return comb_words1,comb_words2
extractor("hunger increases in the morning during workout")
,并且输出符合预期。
唯一需要担心的是,如果我的文本中有2个“ IN”标签,则必须对场景进行特殊的硬编码。
idx1 = indices[0] idx2 = indices[1]
因此,这样,如果有10个“ IN”标签,则需要以这种方式创建10个索引变量。是否有更好的方法可以解决此问题,以便可以根据输入中存在的标签数动态创建索引变量
解决方法
我会使用发电机。
def extractor(text,tag='IN',max_level=None):
text = nltk.word_tokenize(text)
pos_tagged = nltk.pos_tag(text)
indices = [i for i,tupl in enumerate(pos_tagged) if tupl[1] == tag]
# remove the first index if it is 0 -- we don't want empty phrase
if not indices[0]:
indices.pop(0)
# maybe we don't care about tags past 2nd,or 5th,or 10th
# indexing to None will just yield whole array
indices = indices[:max_level] + [len(pos_tagged)]
# the end of previous phrase
prev_index = indices[0]
for index in indices[1:]:
words = pos_tagged[prev_index:index]
prev_index = index
yield ' '.join(word for (word,tag) in words)
list(extractor("hunger increases in the morning during workout"))
# ['in the morning','during workout']
max_level用于限制您关心的最大标签数量-例如,您希望将第5个标签之后的所有内容都合并为一个标签,而不管标签如何,因此请致电extractor(text,max_level=5)
。
编辑:如果最终在第一次出现标签之前需要零件,则将prev_index
初始化为0
而不是indices[0]
。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。