我可以将 Python pandas 数据框用于 NLP 语料库或文档吗?

如何解决我可以将 Python pandas 数据框用于 NLP 语料库或文档吗?

我想试试这个模型 doc_to_vec 作为我的实验

http://tutorialspoint.com/gensim/gensim_doc2vec_model.htm

我想将我的数据集转换为语料库作为训练数据集并应用 Gensim 模型。

这是我的数据集链接

https://drive.google.com/file/d/1S80I_5zkjJfeTzby7OjIqrs1vMJI6jVo/view?usp=sharing

我已经提到了这个 StackOverflow 问题,但无法工作

How to create corpus from pandas data frame to operate with NLTK

你也可以在 google colab 上查看我的代码

https://colab.research.google.com/drive/1BmBNrfsxQ0AIJH_1hfMaMAceQLh2Xk7Q?usp=sharing

import pandas as pd
dataset = pd.read_csv('ADL_Two_column_MoCo.csv',encoding = 'unicode_escape')
dataset = dataset.dropna()

import gensim
def tagged_document(list_of_list_of_words):
    for i,list_of_words in enumerate(list_of_list_of_words):
        yield gensim.models.doc2vec.TaggedDocument(list_of_words,[i])


data = [dataset]
data
data_for_training = list(tagged_document(data))

model = gensim.models.doc2vec.Doc2Vec(vector_size=40,min_count=2,epochs=30)

model.build_vocab(data_for_training)

model.train(data_for_training,total_examples=model.corpus_count,epochs=model.epochs)

len(data_for_training)
1

data_for_training

    [TaggedDocument(words=                                       Smile Canonical              Column  \
 0                         C1=CC=C(C=C1)C2OC(C(O2)CO)CO        CHIRALPAK AD   
 1                     C1=CC=C(C=C1)C(C(C2=CC=CC=C2)O)O        CHIRALPAK AD   
 2                        CC(C1=CC=C(C=C1)C2=CC=CC=C2)O        CHIRALPAK AD   
 5    CC(C1=CC=CC=C1)OC(=O)C2=CC(=CC(=C2)[N+](=O)[O-...        CHIRALPAK AD   
 6       C1=CC=C2C(=C1)C=CC(=C2C3=C(C=CC4=CC=CC=C43)O)O        CHIRALPAK AD   
 ..                                                 ...                 ...   
 839             C1CC(=O)NC(=O)C1N2C(=O)C3=CC=CC=C3C2=O  CHROMEGACHIRAL CCJ   
 840              CC(C1=CC=C(S1)C(=O)C2=CC=CC=C2)C(=O)O  CHROMEGACHIRAL CCJ   
 841  CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C...  CHROMEGACHIRAL CCJ   
 842  CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C...  CHROMEGACHIRAL CCJ   
 843  CCC(COC(=O)C1=CC(=C(C(=C1)OC)OC)OC)(C2=CC=CC=C...  CHROMEGACHIRAL CCJ  




                                      Mobile phase  
 0                                        methanol  
 1                              n-hexane / ethanol  
 2                            water / acetonitrile  
 5                                        methanol  
 6                           n-hexane / 2-propanol  
 ..                                            ...  
 839                                      methanol  
 840  n-hexane / 2-propanol / trifluoroacetic acid  
 841         n-heptane / 2-propanol / diethylamine  
 842                         n-hexane / 2-propanol  
 843                       methanol / diethylamine  
 
 [828 rows x 3 columns],tags=[0])]

这是我得到的价值。

Training data output

我收到此错误

 RuntimeError                              Traceback (most recent call last)
<ipython-input-45-72344a512bb5> in <module>
----> 1 model.train(data_for_training,epochs=model.epochs)

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\doc2vec.py in train(self,documents,corpus_file,total_examples,total_words,epochs,start_alpha,end_alpha,word_count,queue_factor,report_delay,callbacks)
    555             sentences=documents,corpus_file=corpus_file,total_examples=total_examples,total_words=total_words,556             epochs=epochs,start_alpha=start_alpha,end_alpha=end_alpha,word_count=word_count,--> 557             queue_factor=queue_factor,report_delay=report_delay,callbacks=callbacks,**kwargs)
    558 
    559     @classmethod

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in train(self,sentences,compute_loss,callbacks,**kwargs)
   1065             total_words=total_words,epochs=epochs,1066             queue_factor=queue_factor,compute_loss=compute_loss,-> 1067             **kwargs)
   1068 
   1069     def _get_job_params(self,cur_epoch):

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in train(self,data_iterable,**kwargs)
    533             epochs=epochs,534             total_examples=total_examples,--> 535             total_words=total_words,**kwargs)
    536 
    537         for callback in self.callbacks:

C:\ProgramData\Anaconda3\lib\site-packages\gensim\models\base_any2vec.py in _check_training_sanity(self,**kwargs)
   1171 
   1172         if not self.wv.vocab:  # should be set by `build_vocab`
-> 1173             raise RuntimeError("you must first build vocabulary before training the model")
   1174         if not len(self.wv.vectors):
   1175             raise RuntimeError("you must initialize vectors before training the model")

RuntimeError: you must first build vocabulary before training the model

虽然我已经制作了词汇但数据框中的问题。

解决方法

这里有一个非常好的 doc2vec 资源:https://towardsdatascience.com/how-to-vectorize-text-in-dataframes-for-nlp-tasks-3-simple-techniques-82925a5600db

#1. the text must have spaces before and after the =.
#2. word tokenize the doc generating a list of tokenized docs
#3. doc2bow to vectorize the doc into a list of a list
#4. store the corpus in a dataframe column of type object.

 from gensim.corpora.dictionary import Dictionary
 from nltk.tokenize import word_tokenize

 df=pd.read_csv('smile.csv')

 df['corpus']=np.empty
 df['corpus']=df['corpus'].astype(object)
 for key,row in df.iterrows():
     doc=str(row['Smile Canonical'])
     doc=doc.replace('=',' = ')

     tokenized_docs=[word_tokenize(doc.lower())]
     #print(dictionary.token2id)
     dictionary=Dictionary(tokenized_docs)
     corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
     #print(corpus)
     df.loc[key,'corpus']=corpus

print("每个文档被转换成一个词袋,表示每个token出现的频率") 打印(df.head())

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。

相关推荐


依赖报错 idea导入项目后依赖报错,解决方案:https://blog.csdn.net/weixin_42420249/article/details/81191861 依赖版本报错:更换其他版本 无法下载依赖可参考:https://blog.csdn.net/weixin_42628809/a
错误1:代码生成器依赖和mybatis依赖冲突 启动项目时报错如下 2021-12-03 13:33:33.927 ERROR 7228 [ main] o.s.b.d.LoggingFailureAnalysisReporter : *************************** APPL
错误1:gradle项目控制台输出为乱码 # 解决方案:https://blog.csdn.net/weixin_43501566/article/details/112482302 # 在gradle-wrapper.properties 添加以下内容 org.gradle.jvmargs=-Df
错误还原:在查询的过程中,传入的workType为0时,该条件不起作用 &lt;select id=&quot;xxx&quot;&gt; SELECT di.id, di.name, di.work_type, di.updated... &lt;where&gt; &lt;if test=&qu
报错如下,gcc版本太低 ^ server.c:5346:31: 错误:‘struct redisServer’没有名为‘server_cpulist’的成员 redisSetCpuAffinity(server.server_cpulist); ^ server.c: 在函数‘hasActiveC
解决方案1 1、改项目中.idea/workspace.xml配置文件,增加dynamic.classpath参数 2、搜索PropertiesComponent,添加如下 &lt;property name=&quot;dynamic.classpath&quot; value=&quot;tru
删除根组件app.vue中的默认代码后报错:Module Error (from ./node_modules/eslint-loader/index.js): 解决方案:关闭ESlint代码检测,在项目根目录创建vue.config.js,在文件中添加 module.exports = { lin
查看spark默认的python版本 [root@master day27]# pyspark /home/software/spark-2.3.4-bin-hadoop2.7/conf/spark-env.sh: line 2: /usr/local/hadoop/bin/hadoop: No s
使用本地python环境可以成功执行 import pandas as pd import matplotlib.pyplot as plt # 设置字体 plt.rcParams[&#39;font.sans-serif&#39;] = [&#39;SimHei&#39;] # 能正确显示负号 p
错误1:Request method ‘DELETE‘ not supported 错误还原:controller层有一个接口,访问该接口时报错:Request method ‘DELETE‘ not supported 错误原因:没有接收到前端传入的参数,修改为如下 参考 错误2:cannot r
错误1:启动docker镜像时报错:Error response from daemon: driver failed programming external connectivity on endpoint quirky_allen 解决方法:重启docker -&gt; systemctl r
错误1:private field ‘xxx‘ is never assigned 按Altʾnter快捷键,选择第2项 参考:https://blog.csdn.net/shi_hong_fei_hei/article/details/88814070 错误2:启动时报错,不能找到主启动类 #
报错如下,通过源不能下载,最后警告pip需升级版本 Requirement already satisfied: pip in c:\users\ychen\appdata\local\programs\python\python310\lib\site-packages (22.0.4) Coll
错误1:maven打包报错 错误还原:使用maven打包项目时报错如下 [ERROR] Failed to execute goal org.apache.maven.plugins:maven-resources-plugin:3.2.0:resources (default-resources)
错误1:服务调用时报错 服务消费者模块assess通过openFeign调用服务提供者模块hires 如下为服务提供者模块hires的控制层接口 @RestController @RequestMapping(&quot;/hires&quot;) public class FeignControl
错误1:运行项目后报如下错误 解决方案 报错2:Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.1:compile (default-compile) on project sb 解决方案:在pom.
参考 错误原因 过滤器或拦截器在生效时,redisTemplate还没有注入 解决方案:在注入容器时就生效 @Component //项目运行时就注入Spring容器 public class RedisBean { @Resource private RedisTemplate&lt;String
使用vite构建项目报错 C:\Users\ychen\work&gt;npm init @vitejs/app @vitejs/create-app is deprecated, use npm init vite instead C:\Users\ychen\AppData\Local\npm-