2016-12-06 6 views
0

シルキースコアを計算するにはscikit-learnの例silhouette_analysisのようにします。sklearn clustering:TF-IDFウェイトデータのシルエット係数を計算する

from sklearn.feature_extraction.text import TfidfVectorizer 

tfidf_vectorizer = TfidfVectorizer(use_idf=True) 
sampleText = [] 
sampleText.append("Some text for document clustering") 
tfidf_matrix = tfidf_vectorizer.fit_transform(sampleText) 

は、どのように私はこのようなことを行うために私tfidf_matrixを変換する必要がありません。

import matplotlib.cm as cm 
from sklearn.metrics import silhouette_samples, silhouette_score 
import matplotlib.pyplot as plt 


for num_clusters in range(2,6): 
    # Create a subplot with 1 row and 2 columns 
    fig, (ax1, ax2) = plt.subplots(1, 2) 
    fig.set_size_inches(18, 7) 

    # The 1st subplot is the silhouette plot 
    # The silhouette coefficient can range from -1, 1 but in this example all 
    # lie within [-0.1, 1] 
    ax1.set_xlim([-0.1, 1]) 
    # The (n_clusters+1)*10 is for inserting blank space between silhouette 
    # plots of individual clusters, to demarcate them clearly. 
    ax1.set_ylim([0, len(tfidf_matrix) + (num_clusters + 1) * 10]) 

    km = KMeans(n_clusters=num_clusters, 
       n_init=10,      # number of iterations with different seeds 
       random_state=1     # fixes the seed 
       ) 

    cluster_labels = km.fit_predict(tfidf_matrix) 

    # The silhouette_score gives the average value for all the samples. 
    # This gives a perspective into the density and separation of the formed 
    # clusters 
    silhouette_avg = silhouette_score(tfidf_matrix, cluster_labels) 

答えて

0

TF-IDFは多次元であり、2つの次元に減少させなければなりません。これは、tf-idfを最も高い分散を有する2つの特徴に減少させることによって行うことができる。私はtf-idfを減らすためにPCAを使用します。完全な例:

from sklearn.feature_extraction.text import TfidfVectorizer 

tfidf_vectorizer = TfidfVectorizer(use_idf=True) 
sampleText = [] 
sampleText.append("Some text for document clustering") 
tfidf_matrix = tfidf_vectorizer.fit_transform(sampleText) 
X = tfidf_vectorizer.fit_transform(jobDescriptions).todense() 

from sklearn.decomposition import PCA 
pca = PCA(n_components=2).fit(X) 
data2D = pca.transform(X) 

import matplotlib.cm as cm 
from sklearn.metrics import silhouette_samples, silhouette_score 
import matplotlib.pyplot as plt 


for num_clusters in range(2,6): 
# Create a subplot with 1 row and 2 columns 
fig, (ax1, ax2) = plt.subplots(1, 2) 
fig.set_size_inches(18, 7) 

# The 1st subplot is the silhouette plot 
# The silhouette coefficient can range from -1, 1 but in this example all 
# lie within [-0.1, 1] 
ax1.set_xlim([-0.1, 1]) 
# The (n_clusters+1)*10 is for inserting blank space between silhouette 
# plots of individual clusters, to demarcate them clearly. 
ax1.set_ylim([0, len(data2D) + (num_clusters + 1) * 10]) 

km = KMeans(n_clusters=num_clusters, 
      n_init=10,      # number of iterations with different seeds 
      random_state=1     # fixes the seed 
      ) 

cluster_labels = km.fit_predict(data2D) 

# The silhouette_score gives the average value for all the samples. 
# This gives a perspective into the density and separation of the formed 
# clusters 
silhouette_avg = silhouette_score(data2D, cluster_labels) 
関連する問題