我使用Kmeans算法处理标签数据时遇到问题。我的测试句子获得了True Cluster,但是我没有得到True标签。我已经使用numpy来将集群与true_label_test进行匹配,但是此kmeans可以移动集群,真实标签与集群的数量不匹配。我需要这个问题的帮助。这是我的代码
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import numpy as np
from collections import Counter
stop = set(stopwords.words('indonesian'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
# Cleaning the text sentences so that punctuation marks,stop words & digits are removed
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
processed = re.sub(r"\d+","",normalized)
y = processed.split()
#print (y)
return y
path = "coba.txt"
train_clean_sentences = []
fp = open(path,'r')
for line in fp:
line = line.strip()
cleaned = clean(line)
cleaned = ' '.join(cleaned)
train_clean_sentences.append(cleaned)
#print(train_clean_sentences)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_clean_sentences)
# Clustering the training 30 sentences with K-means technique
modelkmeans = KMeans(n_clusters=3,init='k-means++',max_iter=200,n_init=100)
modelkmeans.fit(X)
teks_satu = "Aplikasi Machine Learning untuk mengenali daun mangga dengan metode CNN"
test_clean_sentence = []
cleaned_test = clean(teks_satu)
cleaned = ' '.join(cleaned_test)
cleaned = re.sub(r"\d+",cleaned)
test_clean_sentence.append(cleaned)
Test = vectorizer.transform(test_clean_sentence)
true_test_labels = ['AI','VR','Sistem Informasi']
predicted_labels_kmeans = modelkmeans.predict(Test)
print(predicted_labels_kmeans)
print ("\n-------------------------------PREDICTIONS BY K-Means--------------------------------------")
print ("\nIndex of Virtual Reality : ",Counter(modelkmeans.labels_[5:10]).most_common(1)[0][0])
print ("Index of Machine Learning : ",Counter(modelkmeans.labels_[0:5]).most_common(1)[0][0])
print ("Index of Sistem Informasi : ",Counter(modelkmeans.labels_[10:15]).most_common(1)[0][0])
print ("\n",teks_satu,":",true_test_labels[np.int(predicted_labels_kmeans)],predicted_labels_kmeans)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。