Text clustering

This example shows how to cluster documents using scikit-learn.

We first import the two parts we need from scikit-learn: TfidfVectorizer for preprocessing and KMeans for clustering.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.cluster import KMeans

Reading documents and creating the term/document matrix

We read the documents as in Assignment 2.

Even though this will be an unsupervised setup, we read the labels as well. We will use them for evaluation. The labels correspond to the type of product being reviewed: books, cameras, DVDs, health products, music, software.

In [2]:
def read_documents(doc_file):
    docs = []
    labels = []
    with open(doc_file, encoding='utf-8') as f:
        for line in f:
            label, _, _, doc = line.strip().split(maxsplit=3)
            docs.append(doc)
            labels.append(label)
    return docs, labels

To exemplify, here is one document and its label.

In [3]:
docs, labels = read_documents('all_sentiment_shuffled.txt')
labels[1], docs[1]
Out[3]:
('music',
 'i was misled and thought i was buying the entire cd and it contains one song')

We create a vectorizer: this is the scikit-learn component that converts raw data (in this case documents) into a matrix. In this case, this conversion is based on TF-IDF. (We will see more about vectorizers in the course Applied Machine Learning.)

In [4]:
vectorizer = TfidfVectorizer(stop_words='english')
doc_matrix = vectorizer.fit_transform(docs)

Taking a look at the term/document matrix

The 11914 rows of the matrix correspond to the documents, while the 46619 columns correspond to the unique words ("terms") in the observed vocabulary.

In [5]:
doc_matrix.shape
Out[5]:
(11914, 46619)

To exemplify, here is the representation of the document above. This is a sparse matrix representation, so we only see the columns where the stored value is non-zero.

Please note that the number of words here is smaller than in the document. This is because we removed some stopwords (the, it etc).

In [6]:
print(doc_matrix[1])
  (0, 38646)	0.2909487247074812
  (0, 6662)	0.309574108186583
  (0, 26964)	0.6467026578598158
  (0, 41736)	0.26731640860851746
  (0, 14541)	0.3291569625199514
  (0, 7423)	0.2480761887729681
  (0, 9540)	0.3999302685791417

To be able to look a bit more easily at the matrix, we'll need to go back from column positions to the corresponding word. This requires a little bit of work.

The vectorizer has a dictionary called vocabulary_ that stores the word-to-column mapping. We compute the inverse of this mapping, that is the column-to-word mapping, and store this as a dictionary.

In [7]:
vocab_inverted = {}
for word, column_index in vectorizer.vocabulary_.items():
    vocab_inverted[column_index] = word

To exemplify, the position 38646 (as in the document vector above) represents the word song.

In [8]:
vocab_inverted[38646]
Out[8]:
'song'

Running the clustering algorithm

Now, let's apply the $k$-means algorithm. We first create the clusterer object (KMeans).

Because we know a priori that this set of reviews has six different groups (corresponding to the type of product being reviewed), let's tell the algorithm to find six clusters. In most real-world situations, we won't have this information but we would need to use various rules of thumb to select the number of clusters. (For instance, for a GMM we could use the AIC or BIC criteria.)

We also set verbose=True so that we can see what the algorithm is doing.

In [9]:
clusterer = KMeans(n_clusters=6, verbose=True)

To carry out the actual clustering, we call the method fit_predict. This will do two things: (1) finding the centroids, and (2) assigning each document to a cluster. It will return an array corresponding to the cluster assignments for each document. Running this will take a couple of minutes, depending on how fast your machine is.

$k$-means is sensitive to initialization, so the implementation in scikit-learn will run the algorithm several times, using different ways to initialize. (This approach is called $k$-means++.)

In [10]:
clustered_docs = clusterer.fit_predict(doc_matrix)
Initialization complete
Iteration  0, inertia 22895.553
Iteration  1, inertia 11690.972
Iteration  2, inertia 11627.764
Iteration  3, inertia 11583.360
Iteration  4, inertia 11567.809
Iteration  5, inertia 11563.545
Iteration  6, inertia 11561.870
Iteration  7, inertia 11560.012
Iteration  8, inertia 11558.351
Iteration  9, inertia 11558.066
Iteration 10, inertia 11557.956
Iteration 11, inertia 11557.893
Iteration 12, inertia 11557.806
Iteration 13, inertia 11557.754
Iteration 14, inertia 11557.727
Iteration 15, inertia 11557.710
Iteration 16, inertia 11557.700
Iteration 17, inertia 11557.696
Iteration 18, inertia 11557.692
Iteration 19, inertia 11557.691
Iteration 20, inertia 11557.690
Converged at iteration 20: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22921.584
Iteration  1, inertia 11643.862
Iteration  2, inertia 11583.598
Iteration  3, inertia 11573.115
Iteration  4, inertia 11570.450
Iteration  5, inertia 11569.240
Iteration  6, inertia 11568.379
Iteration  7, inertia 11567.444
Iteration  8, inertia 11566.596
Iteration  9, inertia 11565.973
Iteration 10, inertia 11565.388
Iteration 11, inertia 11564.806
Iteration 12, inertia 11563.969
Iteration 13, inertia 11562.881
Iteration 14, inertia 11561.719
Iteration 15, inertia 11560.094
Iteration 16, inertia 11557.318
Iteration 17, inertia 11554.788
Iteration 18, inertia 11553.611
Iteration 19, inertia 11553.232
Iteration 20, inertia 11553.067
Iteration 21, inertia 11552.991
Iteration 22, inertia 11552.960
Iteration 23, inertia 11552.937
Iteration 24, inertia 11552.928
Iteration 25, inertia 11552.922
Iteration 26, inertia 11552.918
Iteration 27, inertia 11552.906
Iteration 28, inertia 11552.901
Iteration 29, inertia 11552.899
Iteration 30, inertia 11552.898
Iteration 31, inertia 11552.897
Iteration 32, inertia 11552.894
Iteration 33, inertia 11552.893
Iteration 34, inertia 11552.892
Converged at iteration 34: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22660.357
Iteration  1, inertia 11673.536
Iteration  2, inertia 11599.259
Iteration  3, inertia 11580.582
Iteration  4, inertia 11576.115
Iteration  5, inertia 11574.036
Iteration  6, inertia 11572.906
Iteration  7, inertia 11571.931
Iteration  8, inertia 11571.024
Iteration  9, inertia 11570.414
Iteration 10, inertia 11569.769
Iteration 11, inertia 11568.422
Iteration 12, inertia 11565.223
Iteration 13, inertia 11557.939
Iteration 14, inertia 11555.945
Iteration 15, inertia 11555.663
Iteration 16, inertia 11555.596
Iteration 17, inertia 11555.562
Iteration 18, inertia 11555.544
Iteration 19, inertia 11555.528
Iteration 20, inertia 11555.523
Iteration 21, inertia 11555.517
Iteration 22, inertia 11555.514
Iteration 23, inertia 11555.511
Iteration 24, inertia 11555.505
Iteration 25, inertia 11555.502
Iteration 26, inertia 11555.500
Converged at iteration 26: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22631.655
Iteration  1, inertia 11651.082
Iteration  2, inertia 11579.565
Iteration  3, inertia 11567.284
Iteration  4, inertia 11564.161
Iteration  5, inertia 11562.791
Iteration  6, inertia 11562.143
Iteration  7, inertia 11561.898
Iteration  8, inertia 11561.729
Iteration  9, inertia 11561.563
Iteration 10, inertia 11561.419
Iteration 11, inertia 11561.239
Iteration 12, inertia 11560.956
Iteration 13, inertia 11560.447
Iteration 14, inertia 11559.355
Iteration 15, inertia 11557.190
Iteration 16, inertia 11554.643
Iteration 17, inertia 11553.333
Iteration 18, inertia 11552.956
Iteration 19, inertia 11552.817
Iteration 20, inertia 11552.748
Iteration 21, inertia 11552.712
Iteration 22, inertia 11552.703
Converged at iteration 22: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22721.713
Iteration  1, inertia 11655.987
Iteration  2, inertia 11601.669
Iteration  3, inertia 11583.571
Iteration  4, inertia 11576.116
Iteration  5, inertia 11570.798
Iteration  6, inertia 11568.433
Iteration  7, inertia 11568.098
Iteration  8, inertia 11567.995
Iteration  9, inertia 11567.957
Iteration 10, inertia 11567.950
Iteration 11, inertia 11567.944
Iteration 12, inertia 11567.943
Iteration 13, inertia 11567.942
Iteration 14, inertia 11567.941
Iteration 15, inertia 11567.941
Converged at iteration 15: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22898.879
Iteration  1, inertia 11680.613
Iteration  2, inertia 11613.889
Iteration  3, inertia 11577.890
Iteration  4, inertia 11558.962
Iteration  5, inertia 11553.699
Iteration  6, inertia 11553.055
Iteration  7, inertia 11552.824
Iteration  8, inertia 11552.733
Iteration  9, inertia 11552.695
Iteration 10, inertia 11552.685
Iteration 11, inertia 11552.679
Iteration 12, inertia 11552.676
Iteration 13, inertia 11552.671
Iteration 14, inertia 11552.670
Converged at iteration 14: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22607.657
Iteration  1, inertia 11672.421
Iteration  2, inertia 11614.832
Iteration  3, inertia 11586.874
Iteration  4, inertia 11578.432
Iteration  5, inertia 11572.639
Iteration  6, inertia 11567.710
Iteration  7, inertia 11564.260
Iteration  8, inertia 11562.880
Iteration  9, inertia 11561.871
Iteration 10, inertia 11561.231
Iteration 11, inertia 11560.819
Iteration 12, inertia 11560.543
Iteration 13, inertia 11560.340
Iteration 14, inertia 11560.226
Iteration 15, inertia 11560.120
Iteration 16, inertia 11560.021
Iteration 17, inertia 11559.922
Iteration 18, inertia 11559.830
Iteration 19, inertia 11559.738
Iteration 20, inertia 11559.646
Iteration 21, inertia 11559.539
Iteration 22, inertia 11559.463
Iteration 23, inertia 11559.354
Iteration 24, inertia 11559.240
Iteration 25, inertia 11559.047
Iteration 26, inertia 11558.842
Iteration 27, inertia 11558.729
Iteration 28, inertia 11558.680
Iteration 29, inertia 11558.664
Iteration 30, inertia 11558.661
Iteration 31, inertia 11558.660
Iteration 32, inertia 11558.659
Converged at iteration 32: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22859.115
Iteration  1, inertia 11665.382
Iteration  2, inertia 11613.843
Iteration  3, inertia 11592.303
Iteration  4, inertia 11581.990
Iteration  5, inertia 11573.966
Iteration  6, inertia 11563.858
Iteration  7, inertia 11559.242
Iteration  8, inertia 11557.588
Iteration  9, inertia 11556.765
Iteration 10, inertia 11556.230
Iteration 11, inertia 11555.661
Iteration 12, inertia 11555.003
Iteration 13, inertia 11554.671
Iteration 14, inertia 11554.441
Iteration 15, inertia 11554.259
Iteration 16, inertia 11554.110
Iteration 17, inertia 11553.981
Iteration 18, inertia 11553.828
Iteration 19, inertia 11553.738
Iteration 20, inertia 11553.697
Iteration 21, inertia 11553.656
Iteration 22, inertia 11553.609
Iteration 23, inertia 11553.522
Iteration 24, inertia 11553.323
Iteration 25, inertia 11553.172
Iteration 26, inertia 11553.063
Iteration 27, inertia 11553.027
Iteration 28, inertia 11553.004
Iteration 29, inertia 11552.986
Iteration 30, inertia 11552.965
Iteration 31, inertia 11552.954
Iteration 32, inertia 11552.945
Iteration 33, inertia 11552.940
Iteration 34, inertia 11552.936
Iteration 35, inertia 11552.934
Iteration 36, inertia 11552.932
Iteration 37, inertia 11552.931
Converged at iteration 37: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22619.247
Iteration  1, inertia 11653.286
Iteration  2, inertia 11598.335
Iteration  3, inertia 11585.803
Iteration  4, inertia 11580.178
Iteration  5, inertia 11572.292
Iteration  6, inertia 11563.803
Iteration  7, inertia 11562.456
Iteration  8, inertia 11562.110
Iteration  9, inertia 11561.976
Iteration 10, inertia 11561.905
Iteration 11, inertia 11561.870
Iteration 12, inertia 11561.842
Iteration 13, inertia 11561.825
Iteration 14, inertia 11561.813
Iteration 15, inertia 11561.799
Iteration 16, inertia 11561.790
Iteration 17, inertia 11561.785
Iteration 18, inertia 11561.780
Iteration 19, inertia 11561.778
Iteration 20, inertia 11561.777
Iteration 21, inertia 11561.775
Converged at iteration 21: center shift 0.000000e+00 within tolerance 2.118078e-09
Initialization complete
Iteration  0, inertia 22611.229
Iteration  1, inertia 11628.396
Iteration  2, inertia 11578.602
Iteration  3, inertia 11564.876
Iteration  4, inertia 11560.017
Iteration  5, inertia 11558.014
Iteration  6, inertia 11556.897
Iteration  7, inertia 11556.214
Iteration  8, inertia 11555.701
Iteration  9, inertia 11555.238
Iteration 10, inertia 11554.933
Iteration 11, inertia 11554.681
Iteration 12, inertia 11554.494
Iteration 13, inertia 11554.329
Iteration 14, inertia 11554.132
Iteration 15, inertia 11553.944
Iteration 16, inertia 11553.841
Iteration 17, inertia 11553.787
Iteration 18, inertia 11553.737
Iteration 19, inertia 11553.646
Iteration 20, inertia 11553.529
Iteration 21, inertia 11553.313
Iteration 22, inertia 11553.182
Iteration 23, inertia 11553.082
Iteration 24, inertia 11553.043
Iteration 25, inertia 11553.025
Iteration 26, inertia 11553.002
Iteration 27, inertia 11552.989
Iteration 28, inertia 11552.979
Iteration 29, inertia 11552.969
Iteration 30, inertia 11552.964
Iteration 31, inertia 11552.958
Iteration 32, inertia 11552.949
Iteration 33, inertia 11552.940
Iteration 34, inertia 11552.934
Iteration 35, inertia 11552.933
Iteration 36, inertia 11552.932
Iteration 37, inertia 11552.931
Converged at iteration 37: center shift 0.000000e+00 within tolerance 2.118078e-09

Interpreting the clusters

We will now try to interpret meaning of the different clusters.

The approach here could be used in a clustering algorithm that is based on centroids, such as $k$-means or GMM. Each cluster centroid is a vector that represents the coordinates of the "center" of that cluster.

Since each dimension in the centroid vector corresponds to a word, we will find the most "important" or representative words simply by sorting the weights in the centroid from highest to lowest. We'll then output the $n$ words with the highest score and use this as a description of the cluster.

In [11]:
def summarize_cluster(center, inv_voc, n):
    # the centroid is a vector; we'll first make a list of weights and their corresponding dimensions
    # e.g. [ (0.345, 0), (1.48, 1), (0.95, 2), ...]
    center_index = [ (x, i) for i, x in enumerate(center) ]

    # we sort this by the weights and select the top n
    topn = sorted(center_index, reverse=True)[:n]

    # we finally map this back to words, using the inverted vocabulary dict that we created above
    return [ inv_voc[i] for _, i in topn ]

Let's print the top 10 words for each of the six clusters. In most cases, you will see clusters that seem interpretable in terms of the six product types we know exist in this dataset. (However, the exact result you get can depend on the initialization of $k$-means.)

In [12]:
for i, c in enumerate(clusterer.cluster_centers_):
    print('{}:'.format(i), ' '.join(summarize_cluster(c, vocab_inverted, 10)))
0: product great good like just use does hair time did
1: movie film movies like watch story just good great acting
2: album cd music songs quot song like just great band
3: software program product version use computer support easy windows microsoft
4: camera lens pictures canon digital use flash battery quality great
5: book read books author reading story like quot just written

Quantitative evaluation of the cluster quality

We will compute a score that measures to what extent the clustering algorithm finds groups that are similar to the groups defined by humans (in this case, the type of product).

There are some different ways to compute such a score. In this case, we will use the purity score. Intuitively, the algorithm to compute the purity score will go through each cluster, and see how "pure" it is: how well does the most common human-labeled category cover that cluster? For instance, if cluster 1 computed by $k$-means completely consists of camera reviews and no other type of review, then it is considered completely "pure". See the Wikipedia page for more details.

In [13]:
# as in Assignment 2, we use a Counter to store frequencies
from collections import Counter

def purity(labels, clustered):
    
    # find the set of cluster ids
    cluster_ids = set(clustered)

    N = len(clustered)
    majority_sum = 0  
    for cl in cluster_ids:
        
        # for this cluster, we compute the frequencies of the different human labels we encounter
        # the result will be something like { 'camera':1, 'books':5, 'software':3 } etc.
        labels_cl = Counter(l for l, c in zip(labels, clustered) if c == cl)

        # we select the *highest* score and add it to the total sum
        majority_sum += max(labels_cl.values())

    # the purity score is the sum of majority counts divided by the total number of items
    return majority_sum / N

We compute the purity score for the clusters we created above. In my case I got about 0.69; the result will fluctuate a bit, depending on $k$-means initialization. This seems fairly decent. (The purity score ranges between 0 and 1.)

In [14]:
purity(labels, clustered_docs)
Out[14]:
0.6870908175256001

In addition to the purity, it's common to report the inverse purity as well. This is the same algorithm, just computed in the different direction.

The reason why we'd like to compute it in both directions is that the purity score can be "gamed": by creating one cluster for every data point, each cluster would be completely "pure". Conversely, the inverse purity can be gamed by creating one big cluster containing every data point. By reporting both scores, we achieve a balance.

In my case, I got an inverse purity score that was identical to the original purity score. This is not the case in general.

In [15]:
purity(clustered_docs, labels)
Out[15]:
0.6875944267248615

The purity score has a lower bound of zero, but it's more interesting to compare to a random clustering. This will fluctuate a bit, but I got scores around 0.18 in this case, so our clustering performance seems much better.

In [16]:
import numpy as np
random_clustering = np.random.randint(6, size=len(labels))

purity(labels, random_clustering), purity(random_clustering, labels)
Out[16]:
(0.17743830787309048, 0.17903307033741817)

Scikit-learn includes several clustering evaluation metrics (but not the purity/inverse purity). For instance, the adjusted Rand score:

In [17]:
from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score(labels, clustered_docs)
Out[17]:
0.32673751561533537