如何解决值错误:可对原始文本文档进行迭代,接收到字符串对象TFIDF矢量化
我已经做了很多尝试来找出问题所在,但仍然会出现错误。我不知道为什么! #库已导入
import pandas as pd
import numpy as np
import nltk
import sklearn
from sklearn import svm
from sklearn.model_selection import train_test_split
from nltk.probability import FreqDist
from collections import Counter
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as pt
import string
from wordcloud import WordCloud
from PIL import Image
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.corpus import wordnet
from nltk.tag import pos_tag#Libraries imported
import pandas as pd
import numpy as np
import nltk
import sklearn
from sklearn import svm
from sklearn.model_selection import train_test_split
from nltk.probability import FreqDist
from collections import Counter
from nltk import word_tokenize,sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as pt
import string
from wordcloud import WordCloud
from PIL import Image
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.corpus import wordnet
from nltk.tag import pos_tag
TR_DS=pd.read_csv("D:\DataSet\Training_Dataset.csv")
TR_DS.drop(["title","date","subject"],axis=1,inplace=True)#Delete unnecessary columns
TR_DS.isnull().any()
TR_DS.head()
#text column with small letters
news_text=TR_DS['text'].str.lower()
#Remove Punctuations
news_text=news_text.str.replace(r'[^\w\d\s]',' ')
#Remove Digits
news_text=news_text.str.replace('\d+','')
#Tokenization (Text as Words)
def tokenize(text):
tokens= re.split('\W+',text)
return tokens
news_text=news_text.apply(lambda x : word_tokenize(x.lower()))
#Remove StopWords
stop_words=stopwords.words("english")
def remove_stopwords(text):
text = [word for word in text if word not in stop_words]
return text
news_text=news_text.apply(lambda x: remove_stopwords(x))
#Lemmatization Words in Stem Form
WordNetLemmatizer = nltk.WordNetLemmatizer()
def lemmer(text):
text = [WordNetLemmatizer.lemmatize(word,pos='v') for word in text]
return text
news_text=news_text.apply(lambda x: lemmer(x))
print(news_text)
输出
0 [reuters,man,accuse,tackle,u,senator,ran... 1 [washington,reuters,republican,senators,... 2 [reuters,proposals,republicans,repeal,r... 3 [reuters,tax,plan,senate,r... 4 [washington,americans,likely,belie...
... 295 [donald,trump,kick,hispanic,heritage,mont... 296 [us,even,conservatives,know,donald,... 297 [take,office,repeatedly,take... 298 [sunday,august,th,salt,lake,city,police,... 299 [u,rep,tim,murphy,staunch,pro,life,repu... Name: text,Length: 300,dtype: object
tfidf_vect = TfidfVectorizer(max_features=5000,ngram_range = (1,2))
def TFIDF(text):
text = [tfidf_vect.fit(word) for word in text]
return text
news_text=news_text.apply(lambda x: TFIDF(x))
**错误
ValueError Traceback (most recent call last) <ipython-input-232-886e227564fb> in <module>
4 text = [tfidf_vect.fit(word) for word in text]
5 return text
----> 6 news_text=news_text.apply(lambda x: TFIDF(x))
c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site-packages\pandas\core\series.py in apply(self,func,convert_dtype,args,**kwds) 4210 else: 4211 values = self.astype(object)._values
-> 4212 mapped = lib.map_infer(values,f,convert=convert_dtype) 4213 4214 if len(mapped) and isinstance(mapped[0],Series):
pandas\_libs\lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-232-886e227564fb> in <lambda>(x)
4 text = [tfidf_vect.fit(word) for word in text]
5 return text
----> 6 news_text=news_text.apply(lambda x: TFIDF(x))
<ipython-input-232-886e227564fb> in TFIDF(text)
2 tfidf_vect = TfidfVectorizer()
3 def TFIDF(text):
----> 4 text = [tfidf_vect.fit(word) for word in text]
5 return text
6 news_text=news_text.apply(lambda x: TFIDF(x))
<ipython-input-232-886e227564fb> in <listcomp>(.0)
2 tfidf_vect = TfidfVectorizer()
3 def TFIDF(text):
----> 4 text = [tfidf_vect.fit(word) for word in text]
5 return text
6 news_text=news_text.apply(lambda x: TFIDF(x))
c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site-packages\sklearn\feature_extraction\text.py in fit(self,raw_documents,y) 1816 self._check_params() 1817 self._warn_for_unused_params()
-> 1818 X = super().fit_transform(raw_documents) 1819 self._tfidf.fit(X) 1820 return self
c:\users\mahad maqsood\appdata\local\programs\python\python39\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self,y) 1186 # TfidfVectorizer. 1187 if isinstance(raw_documents,str):
-> 1188 raise ValueError( 1189 "Iterable over raw text documents expected," 1190 "string object received.")
ValueError: Iterable over raw text documents expected,string object received.
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。