This example shows how to cluster documents using scikit-learn.
We first import the two parts we need from scikit-learn: TfidfVectorizer
for preprocessing and KMeans
for clustering.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
We read the documents as in Assignment 2.
Even though this will be an unsupervised setup, we read the labels as well. We will use them for evaluation. The labels correspond to the type of product being reviewed: books, cameras, DVDs, health products, music, software.
def read_documents(doc_file):
docs = []
labels = []
with open(doc_file, encoding='utf-8') as f:
for line in f:
label, _, _, doc = line.strip().split(maxsplit=3)
docs.append(doc)
labels.append(label)
return docs, labels
To exemplify, here is one document and its label.
docs, labels = read_documents('all_sentiment_shuffled.txt')
labels[1], docs[1]
We create a vectorizer: this is the scikit-learn component that converts raw data (in this case documents) into a matrix. In this case, this conversion is based on TF-IDF. (We will see more about vectorizers in the course Applied Machine Learning.)
vectorizer = TfidfVectorizer(stop_words='english')
doc_matrix = vectorizer.fit_transform(docs)
The 11914 rows of the matrix correspond to the documents, while the 46619 columns correspond to the unique words ("terms") in the observed vocabulary.
doc_matrix.shape
To exemplify, here is the representation of the document above. This is a sparse matrix representation, so we only see the columns where the stored value is non-zero.
Please note that the number of words here is smaller than in the document. This is because we removed some stopwords (the, it etc).
print(doc_matrix[1])
To be able to look a bit more easily at the matrix, we'll need to go back from column positions to the corresponding word. This requires a little bit of work.
The vectorizer has a dictionary called vocabulary_
that stores the word-to-column mapping. We compute the inverse of this mapping, that is the column-to-word mapping, and store this as a dictionary.
vocab_inverted = {}
for word, column_index in vectorizer.vocabulary_.items():
vocab_inverted[column_index] = word
To exemplify, the position 38646 (as in the document vector above) represents the word song.
vocab_inverted[38646]
Now, let's apply the $k$-means algorithm. We first create the clusterer object (KMeans
).
Because we know a priori that this set of reviews has six different groups (corresponding to the type of product being reviewed), let's tell the algorithm to find six clusters. In most real-world situations, we won't have this information but we would need to use various rules of thumb to select the number of clusters. (For instance, for a GMM we could use the AIC or BIC criteria.)
We also set verbose=True
so that we can see what the algorithm is doing.
clusterer = KMeans(n_clusters=6, verbose=True)
To carry out the actual clustering, we call the method fit_predict
. This will do two things: (1) finding the centroids, and (2) assigning each document to a cluster. It will return an array corresponding to the cluster assignments for each document. Running this will take a couple of minutes, depending on how fast your machine is.
$k$-means is sensitive to initialization, so the implementation in scikit-learn will run the algorithm several times, using different ways to initialize. (This approach is called $k$-means++.)
clustered_docs = clusterer.fit_predict(doc_matrix)
We will now try to interpret meaning of the different clusters.
The approach here could be used in a clustering algorithm that is based on centroids, such as $k$-means or GMM. Each cluster centroid is a vector that represents the coordinates of the "center" of that cluster.
Since each dimension in the centroid vector corresponds to a word, we will find the most "important" or representative words simply by sorting the weights in the centroid from highest to lowest. We'll then output the $n$ words with the highest score and use this as a description of the cluster.
def summarize_cluster(center, inv_voc, n):
# the centroid is a vector; we'll first make a list of weights and their corresponding dimensions
# e.g. [ (0.345, 0), (1.48, 1), (0.95, 2), ...]
center_index = [ (x, i) for i, x in enumerate(center) ]
# we sort this by the weights and select the top n
topn = sorted(center_index, reverse=True)[:n]
# we finally map this back to words, using the inverted vocabulary dict that we created above
return [ inv_voc[i] for _, i in topn ]
Let's print the top 10 words for each of the six clusters. In most cases, you will see clusters that seem interpretable in terms of the six product types we know exist in this dataset. (However, the exact result you get can depend on the initialization of $k$-means.)
for i, c in enumerate(clusterer.cluster_centers_):
print('{}:'.format(i), ' '.join(summarize_cluster(c, vocab_inverted, 10)))
We will compute a score that measures to what extent the clustering algorithm finds groups that are similar to the groups defined by humans (in this case, the type of product).
There are some different ways to compute such a score. In this case, we will use the purity score. Intuitively, the algorithm to compute the purity score will go through each cluster, and see how "pure" it is: how well does the most common human-labeled category cover that cluster? For instance, if cluster 1 computed by $k$-means completely consists of camera reviews and no other type of review, then it is considered completely "pure". See the Wikipedia page for more details.
# as in Assignment 2, we use a Counter to store frequencies
from collections import Counter
def purity(labels, clustered):
# find the set of cluster ids
cluster_ids = set(clustered)
N = len(clustered)
majority_sum = 0
for cl in cluster_ids:
# for this cluster, we compute the frequencies of the different human labels we encounter
# the result will be something like { 'camera':1, 'books':5, 'software':3 } etc.
labels_cl = Counter(l for l, c in zip(labels, clustered) if c == cl)
# we select the *highest* score and add it to the total sum
majority_sum += max(labels_cl.values())
# the purity score is the sum of majority counts divided by the total number of items
return majority_sum / N
We compute the purity score for the clusters we created above. In my case I got about 0.69; the result will fluctuate a bit, depending on $k$-means initialization. This seems fairly decent. (The purity score ranges between 0 and 1.)
purity(labels, clustered_docs)
In addition to the purity, it's common to report the inverse purity as well. This is the same algorithm, just computed in the different direction.
The reason why we'd like to compute it in both directions is that the purity score can be "gamed": by creating one cluster for every data point, each cluster would be completely "pure". Conversely, the inverse purity can be gamed by creating one big cluster containing every data point. By reporting both scores, we achieve a balance.
In my case, I got an inverse purity score that was identical to the original purity score. This is not the case in general.
purity(clustered_docs, labels)
The purity score has a lower bound of zero, but it's more interesting to compare to a random clustering. This will fluctuate a bit, but I got scores around 0.18 in this case, so our clustering performance seems much better.
import numpy as np
random_clustering = np.random.randint(6, size=len(labels))
purity(labels, random_clustering), purity(random_clustering, labels)
Scikit-learn includes several clustering evaluation metrics (but not the purity/inverse purity). For instance, the adjusted Rand score:
from sklearn.metrics.cluster import adjusted_rand_score
adjusted_rand_score(labels, clustered_docs)