如何解决查询和文档之间的软余弦相似度 1
我正在计算查询:query2 = 'Audit and control,Board structure,Remuneration,Shareholder rights,Transparency and Performance'
和文档(在我的例子中是公司的年度报告)之间的相似性。
我正在使用手套向量并计算向量之间的软余弦,但是不知何故我获得了两个文档的相似度得分为 1。 这怎么可能?我当然知道该文档不只包含这些查询词。该文档是一个带有清理文本的 .txt 文件。如果文档完全匹配这些词,那么相似度可以是 1,但我知道它不完全匹配。
代码:
if 'glove' not in locals():
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
def build_term(corpus,query):
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)
similarity_matrix = SparseTermSimilarityMatrix(similarity_index,dictionary,tfidf)
return similarity_matrix
tfidf_model = build_term(corpus,query)
def doc_similarity_scores(query,similarity_matrix):
dictionary = Dictionary(corpus+[query])
tfidf = TfidfModel(dictionary=dictionary)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(tfidf[[dictionary.doc2bow(document) for document in corpus]],similarity_matrix)
doc_similarity_scores = index[query_tf]
return doc_similarity_scores
document_sim_scores = doc_similarity_scores(query,tfidf_model)
sorted_sim_scores = sort_similarity_scores_by_document(document_sim_scores)
doc_similar_terms = []
max_results_per_doc = 50
for term in query:
dictionary = Dictionary(corpus+[query])
idx1 = dictionary.token2id[term]
for document in corpus:
results_this_doc = []
for word in set(document):
idx2 = dictionary.token2id[word]
score = tfidf_model.matrix[idx1,idx2]
if score > 0.0:
results_this_doc.append((word,score))
results_this_doc = sorted(results_this_doc,reverse=True,key=lambda x: x[1])
results_this_doc = results_this_doc[:min(len(results_this_doc),max_results_per_doc)]
doc_similar_terms.append(results_this_doc)
for idx in sorted_sim_scores[:90]:
similar_terms_string = ','.join([result[0] for result in doc_similar_terms[idx]])
print(f'{idx} \t {document_sim_scores[idx]:0.3f} \t {titles[idx]}')
结果:
25 1.000 2019_q4_en_eur_con_00.txt
14 1.000 2017_q3_en_eur_con_00.txt
16 0.994 2018_ar_en_eur_con_00.txt
21 0.989 2019_ar_en_eur_con_00.txt
28 0.986 2020_q2_en_eur_con_00.txt
1 0.963 2014_ar_en_eur_con_00.txt
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。