如何解决利用spacy摆脱熊猫系列中的停用词
我一直在尝试使用spacy库摆脱停用词。
代码
import spacy
import pandas as pd
import numpy as np
nlp= spacy.load('en_core_web_sm')
我的系列:
my_series
0 this laptop sits at just over 4 stars while so...
1 i ordered this monitor because i wanted to mak...
2 this monitor is a great deal for the price and...
3 bought this for the height adjustment. the swi...
4 worked for a month and then it died. after 5 c...
...
30618 great deal
30619 pour le travail
30620 business use
30621 good size
30622 pour mon ordinateur.plus grande image.vraiment...
Name: text_body,Length: 30623,dtype: object
标记化
s_tokenized=my_series.apply(lambda x: nlp(x))
删除停用词
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
filtered_text
0 [this,laptop,sits,at,just,over,4,stars,...
1 [i,ordered,this,monitor,because,i,wanted...
2 [this,is,a,great,deal,for,the,...
3 [bought,height,adjustment,....
4 [worked,month,and,then,it,died,....
...
30618 [great,deal]
30619 [pour,le,travail]
30620 [business,use]
30621 [good,size]
30622 [pour,mon,ordinateur.plus,grande,image.vra...
Name: text_body,dtype: object
tokenize似乎工作正常,但是删除停用词似乎根本不会删除任何单词,也不会引发任何错误。有什么我想念或做错的事吗?
解决方法
您对此行有疑问:
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w in all_stopwords])
将其更正为:
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
您还可以:
import spacy
nlp=spacy.load("en_core_web_sm")
s_tokenized = my_series.apply(nlp)
all_stopwords = nlp.Defaults.stop_words
filtered_text=s_tokenized.apply(lambda x: [w for w in x if not w.text in all_stopwords])
filtered_text
0 [laptop,sits,4,stars]
1 [ordered,monitor,wanted]
dtype: object
注意,您不需要熊猫系列来保存数据。仅字符串或字符串列表就足够了。 Spacy这样做的方法是,即使内存不足的数据也可以扩展:
import spacy
nlp=spacy.load("en_core_web_sm")
texts = ["this laptop sits at just over 4 stars while","i ordered this monitor because i wanted"]
docs = nlp.pipe(texts)
filtered_text= []
for doc in docs:
# yield [tok for tok in doc if not tok.is_stop]
filtered_text.append([tok for tok in doc if not tok.is_stop])
print(filtered_text)
[[laptop,stars],[ordered,wanted]]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。