如何解决如何使用停用词删除标点符号和不相关的单词文本挖掘
我正在使用的库是:
import pandas as pd
import string
from nltk.corpus import stopwords
import nltk
我有以下数据框:
df = pd.DataFrame({'Send': ['Golgi body,membrane-bound organelle of eukaryotic cells (cells
with clearly defined nuclei).','The Golgi apparatus is responsible for transporting,modifying,and
packaging proteins','Non-foliated metamorphic rocks do not have a platy or sheet-like
structure.','The process of metamorphism does not melt the rocks.'],'Class': ['biology','biology','geography','geography']})
print(df)
Send Class
Golgi body,membrane-bound organelle of eukary... biology
The Golgi apparatus is responsible for transpo... biology
Non-foliated metamorphic rocks do not have a p... geography
The process of metamorphism does not melt the ... geography
我想生成一个用于清除“发送”列中数据的函数。我想:
- 删除分数;
- 删除停用词“ stopwords”;
- 使用“发送”列返回包含“干净单词”的新数据框。
尝试开发以下功能:
def Text_Process(mess):
nopunc = [char for char in mess if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
但是,回报并不完全符合我的期望。当我跑步时:
Text_Process(df['Send'])
输出为:
['Golgi','body,','membrane-bound','organelle','eukaryotic','cells','(cells','clearly','defined','nuclei).The','Golgi','apparatus','responsible','transporting,'modifying,'packaging','proteinsNon-foliated','metamorphic','rocks','platy','sheet-like','structure.The','process','metamorphism','melt','rocks.']
我希望输出是经过修改的“发送”列的数据框:
df = pd.DataFrame({'Send': ['Golgi membrane bound organelle eukaryotic cells cells
clearly defined nuclei','Golgi apparatus responsible transporting modifying
packaging proteins','Non foliated metamorphic rocks platy sheet like
structure','process metamorphism mel rocks'],'geography']})
我希望输出为带有“发送”列的数据帧(没有分数且没有不相关的词)。
谢谢。
解决方法
这是一个清理列的脚本。请注意,您可能想在停用词集中添加更多词,以满足您的要求。
import pandas as pd
import string
import re
from nltk.corpus import stopwords
df = pd.DataFrame(
{'Send': ['Golgi body,membrane-bound organelle of eukaryotic cells (cells with clearly defined nuclei).','The Golgi apparatus is responsible for transporting,modifying,and packaging proteins','Non-foliated metamorphic rocks do not have a platy or sheet-like structure.','The process of metamorphism does not melt the rocks.'],'Class': ['biology','biology','geography','geography']})
table = str.maketrans('','',string.punctuation)
def text_process(mess):
words = re.split(r'\W+',mess)
nopunc = [w.translate(table) for w in words]
nostop = ' '.join([word for word in nopunc if word.lower() not in stopwords.words('english')])
return nostop
df['Send'] = df.apply(lambda row: text_process(row.Send),axis=1)
print(df)
输出:
Send Class
0 Golgi body membrane bound organelle eukaryotic cells cells clearly defined nuclei biology
1 Golgi apparatus responsible transporting modifying packaging proteins biology
2 Non foliated metamorphic rocks platy sheet like structure geography
3 process metamorphism melt rocks geography
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。