如何解决如何将主题列表从 gensim lda get_document_topics()更改为 DataFrame 格式
我已经使用 gensim.models.ldamodel.LdaModel() 进行了一些主题建模,我想标记我的数据,以可视化我的发现。
这是我目前所拥有的:
我当前的数据框有以下列:
['text']['date']['gender']['tokens']['topics']['main_topic']
文本只是纯文本数据,日期的形式为(yyyy-mm-dd),性别是二进制的,女性为1,tokens是预处理后的文本,主题来自:
df['topics'] = LDA_model.get_document_topics(corpus)
和 main_topic 与此 post 的第二个答案略有不同,填充如下:
df['main_topic'] = [int(str(sorted(LDA_model[i],reverse=True,key=lambda x: x[1])[0][0]).zfill(3)) for i in corpus]
最后,topics 和 main_topics 的前 10 行看起来像这样(注意 num_topics=30):
topics main_topic
[(0,0.051341455),(1,0.21204428),(2,0.1145254),(4,0.055585753),(11,0.20260869),(29,0.25616828)] 29
[(0,0.052005265),0.21128647),0.08015486),(3,0.11465485),0.4478401)] 29
[(0,0.05355798),0.1394092),0.10734849),0.32699445),0.273105)] 4
[(0,0.053568278),0.22299954),0.22616898),0.0959242),0.2897638)] 29
[(0,0.05404401),0.4482777),0.141311),0.24849494)] 1
[(0,0.054245334),0.18933308),0.14567153),0.11169399),(23,0.05768766),0.35825193)] 29
[(0,0.05449035),0.114870586),0.13284092),0.075592585),0.13247918),(24,0.06598773),0.32016253)] 29
[(0,0.055871632),0.23100668),0.06832383),0.4730603)] 29
[(0,0.057746172),0.057121024),0.07247137),0.26388222),(13,0.07291462),0.34331965)] 29
[(0,0.057841185),0.19891246),0.09586754),0.5344914)] 29
现在我想要的是:
我想要 30 个新列:“主题 0、主题 1、主题 2、...、主题 29”。对于第一行,我想使用 df['topics'] 并将值保存在新列中,以便:
第 1 行的主题 0 = 0.0513414,第 1 行的主题 1 = 0.21204,第 1 行的主题 2 = 0.11452,第 1 行的主题 3 = 0,依此类推。
但我不知道怎么做。有人可以帮忙吗?
解决方法
我想通了。如果有人希望实现同样的目标:
LDA_model = gensim.models.ldamodel.LdaModel()
dir(gensim.models.ldamodel.LdaModel)
df['topics'] = LDA_model.get_document_topics(corpus)
sf = pd.DataFrame(data=df['topics'])
af = pd.DataFrame()
for i in range(30):
af[str(i)]=[]
frames = [sf,af]
af = pd.concat(frames).fillna(0)
for i in range(6301):
for j in range(len(df['topics'][i])):
af[str(df['topics'][i][j][0])].loc[i] = df['topics'][i][j][1]
(请注意,30 是我的 num_topics,6301 是我在 df['topics' ])
现在数据框 af 看起来像这样 [限制为 5 行和 5 列]:
topics 0 1 2 3
0 [(1,0.055395175),(5,0.0647138),(7,0.13507782),(9,0.055264555),(13,0.19258575),(21,0.05181323),(27,0.07139948)] 0.0 0.05539517477154732 0.0 0.0
1 [(0,0.052290276),(6,0.064590134),0.24019116),(16,0.07827738),0.0994899)] 0.05229027569293976 0.0 0.0 0.0
2 [(6,0.054943837),0.07324204),(10,0.052613333),(12,0.12482096),0.19818054),(29,0.06280263)] 0.0 0.0 0.0 0.0
3 [(4,0.12759669),(8,0.06937062),0.2261674),0.066699274),(24,0.06150386),0.096883684)] 0.0 0.0 0.0 0.0
4 [(2,0.09043305),0.15643781),0.13145259),0.064689845),(17,0.05019963),0.09253424),(28,0.10176642)] 0.0 0.0 0.09043305367231369 0.0
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。