在Python中动态获取两个或多个索引之间的元素，而无需对索引变量进行硬编码

如何解决在Python中动态获取两个或多个索引之间的元素，而无需对索引变量进行硬编码

我正在尝试从输入文本中提取POS标签，并提取2个或更多“ IN”标签之间的所有单词。因此，想法是，如果有1个“ IN”标签，则从标签的索引到句子的结尾进行提取。如果有超过2个“ IN”标签，则应从1个标签的索引提取到另一个“ IN”标签，将短语分成几组。我编写了相同的代码。代码是：

def extractor(text):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
#    print(pos_tagged)
#    Get tuple index of preposition
    indices = [i for i,tupl in enumerate(pos_tagged) if tupl[1] == 'IN']
#    print(indices)
    if len(indices) == 1:
        idx = indices[0]
        phrase = pos_tagged[idx:]
        words = [i[0] for i in phrase]
        comb_words = ' '.join(i for i in words)
        return comb_words 
        
    else:
        idx1 = indices[0]
        idx2 = indices[1]
        phrase1 = pos_tagged[idx1:idx2]
        words1 = [i[0] for i in phrase1]
        comb_words1 = ' '.join(i for i in words1)

        phrase2 = pos_tagged[idx2:]
        words2 = [i[0] for i in phrase2]
        comb_words2 = ' '.join(i for i in words2)
                        
        return comb_words1,comb_words2
        

extractor("hunger increases in the morning during workout")

，并且输出符合预期。唯一需要担心的是，如果我的文本中有2个“ IN”标签，则必须对场景进行特殊的硬编码。 idx1 = indices[0] idx2 = indices[1]

因此，这样，如果有10个“ IN”标签，则需要以这种方式创建10个索引变量。是否有更好的方法可以解决此问题，以便可以根据输入中存在的标签数动态创建索引变量

解决方法

我会使用发电机。

def extractor(text,tag='IN',max_level=None):
    text = nltk.word_tokenize(text)
    pos_tagged = nltk.pos_tag(text)
    
    indices = [i for i,tupl in enumerate(pos_tagged) if tupl[1] == tag]
    
    # remove the first index if it is 0 -- we don't want empty phrase
    if not indices[0]:
        indices.pop(0)
    
    # maybe we don't care about tags past 2nd,or 5th,or 10th
    # indexing to None will just yield whole array
    indices = indices[:max_level] + [len(pos_tagged)]
    
    # the end of previous phrase
    prev_index = indices[0]
    
    for index in indices[1:]:
        words = pos_tagged[prev_index:index]
        prev_index = index
        
        yield ' '.join(word for (word,tag) in words)

list(extractor("hunger increases in the morning during workout"))
# ['in the morning','during workout']

max_level用于限制您关心的最大标签数量-例如，您希望将第5个标签之后的所有内容都合并为一个标签，而不管标签如何，因此请致电extractor(text,max_level=5)。

编辑：如果最终在第一次出现标签之前需要零件，则将prev_index初始化为0而不是indices[0]。

在Python中动态获取两个或多个索引之间的元素，而无需对索引变量进行硬编码

如何解决在Python中动态获取两个或多个索引之间的元素，而无需对索引变量进行硬编码

解决方法

相关推荐