Data Science Portfolio

Document Clustering with K-means

07 Mar 2023

Index

# data manipulation
import pandas as pd
import numpy as np
import time
import re
from tqdm import tqdm

# text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
nltk.downloader.download('vader_lexicon')

# visualizing
import matplotlib.pyplot as plt
import altair as alt
import seaborn as sns
sns.set_style('darkgrid')

# clusters
from sklearn import metrics
from sklearn.cluster import MiniBatchKMeans, KMeans
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
# PCA Decomposition
from sklearn.decomposition import PCA
# silhouette method
from sklearn.metrics import silhouette_samples, silhouette_score

# annoying error messages =P
import warnings
warnings.filterwarnings("ignore")
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\PICHAU\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

Data Gathering

  • In this project i’ll be working with the 20newsgroups dataset.
  • 20newsgroups dataset comprises around 18000 posts newsgroups posts on 20 topics.
# https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
newsgroups_train = fetch_20newsgroups(subset='train')

#show topics
print(list(newsgroups_train.target_names))
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
categories = [
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'rec.sport.baseball',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'rec.sport.hockey',
 'alt.atheism',
 'soc.religion.christian',
]
dataset = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, remove=('headers', 'footers', 'quotes'))
df1 = pd.DataFrame(dataset['data'], columns = ['content'])

Data Preprocessing

Lemmatizing and removing stopwords

news = df1.copy()
import string

def text_process(text):
    stemmer = WordNetLemmatizer()
    # remove punctuation
    nopunc = [char for char in text if char not in string.punctuation] # '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
    # remove digits
    nopunc = ''.join([i for i in nopunc if not i.isdigit()])
    # split each review into words, lowecase them and remove stopwords
    nopunc = [word.lower() for word in nopunc.split() if word not in stopwords.words('english')]
    # lemmatize the list of words created
    return [stemmer.lemmatize(word) for word in nopunc]
t0 = time.time()
news['filtered_content'] = news['content'].apply(text_process).apply(lambda x : " ".join(x))
print(f'Done! Elapsed time: {round(time.time() - t0, 1)} seconds')
Done! Elapsed time: 277.9 seconds
news.shape
(5026, 2)
news.head()
content filtered_content
0 \n\nI have discussed this with my girlfriend o... i discussed girlfriend often i consider marrie...
1 \nYou might want to re-think your attitude abo... you might want rethink attitude holocaust read...
2 \nIt's bad jokes like that which draws crohns,... it bad joke like draw crohn i mean groan crowd...
3 \n\nIf anyone gets the New York Times, the Edi... if anyone get new york time edit page transcri...
4 I apologize for the long delay in getting a re... i apologize long delay getting response posted...

Text vectorizing

  • Vectorization is nothing but the process of converting a text into numerical representation.
  • Indicates the importance of certain word inside of a text corpus.

TFIDF

  • TFIDF is a product of how frequent a word is in a document multiplied by how unique a word is.
  • TF stands for term frequency.
    • Counts how many times certain word appeared inside of a text corpus.
  • IDF stands for Inverse Document Frequency.
    • IDF measures the relevance of the word in other text corpus.
    • In a nutshel, IDF will measure how rare or how common a word will be for the entire dataset.
    • Rare words will have higher values than frequent words.
  • In a classification model for example, will be interesting to use rare words instead of frequent words.
stop_words = stopwords.words('english')
vectorizer = TfidfVectorizer(
                             min_df = 5, 
                             max_df = 0.95,
                             max_features = None,
                             analyzer = 'word',
                             stop_words = stop_words)


tfidf_text = vectorizer.fit_transform(news.filtered_content)

print(f'n_samples:{tfidf_text.shape[0]}, num_features: {tfidf_text.shape[1]}')
n_samples:5026, num_features: 8678
  • max_df is used for removing terms that appear too frequently, also known as “corpus-specific stop words”.
    • max_df = 0.50 means “ignore terms that appear in more than 50% of the documents”.
    • max_df = 25 means “ignore terms that appear in more than 25 documents”.
  • min_df is used for removing terms that appear too infrequently.
    • min_df = 0.01 means “ignore terms that appear in less than 1% of the documents”.
    • min_df = 5 means “ignore terms that appear in less than 5 documents”.
vec = tfidf_text

K-means optimization

  • K-means is an unsupervised learning algorithm.
  • Uses machine learning to analyze and cluster unlabeled datasets.
  • We initially pass a random number of K to see its performance.
  • K corresponds to the number of groups.
  • Sometimes it is difficult to choose an optimal number of K.
  • In order for trying to solve that, we can use some metrics like The elbow method and Silhouette score

Elbow Method

  • Elbow methods measures the euclidean distance from the datapoint to its cluster.
  • Then we square and sum them all up.
  • As the number of clusters increases, each single datapoint will have a cluster center closer to it. Resulting in a decreasing sum of square erros.
  • The main point is to find the "elbow" of the graph, which means by increasing the number of clusters we're getting less and less ideal decrease of the sum of squared error, so the elbow point should be the optimal number for cluster.

Silhouette Method

  • Sometimes it's hard to find a good elbow point on previous elbow graph, so we need to use another strategy to find the optimal number, the Silhouette method.
  • Silhouette method measures the distance between each cluster minus the distance within each cluster.
  • The higher the value of the silhouette score the better it divides the dataset into differente clusters.
def optimise_kmeans(data, max_k):
    means = []
    inertias = []
    
    silhouette_avg = []
    
    for k in tqdm(range(2, max_k + 1)):
        kmeans = KMeans( n_clusters = k)
        kmeans.fit(data)
        means.append(k)
        inertias.append(kmeans.inertia_)
        
        # silhouette
        cluster_labels = kmeans.labels_
        silhouette_avg.append(silhouette_score(data, cluster_labels))
        #labels.append(k)

    
    fig, ax =  plt.subplots(1, 2, figsize = (10,3))
    fig.tight_layout(w_pad = 4)
    
    # Elbow method
    ax[0].plot(means, inertias)
    ax[0].set_title('Elbow Method')
    ax[0].set_xlabel('Values of K')
    #plt.xticks(np.arange(1, k, 1))
    ax[0].set_ylabel('Sum squared Error')
    plt.grid(True)
    
    # Silhouette
    ax[1].plot(means, silhouette_avg , 'bv-', color = 'green')
    ax[1].set_title('Silhouette Score')
    ax[1].set_xlabel('Values of K')
    ax[1].set_ylabel('Score')
    plt.grid(True)

    plt.show()
    
    
optimise_kmeans(vec, 10)
100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [01:05<00:00,  7.26s/it]

png

  • It appears that the “elbow” point isn’t that clear to find, so we can’t define a good number for K.
  • With the help of silhouette score, we can see the score drastically decreasing from cluster 2 to 3, peaking at cluster 4 and getting chaotic after it, looks like k-means is struggling to find an optimal score from past cluster 4.
  • That said, i will be taking k = 4.
kmeans = KMeans(init = 'random', 
                n_clusters = 4,
                n_init = 10,
                random_state = 42)
kmeans.fit(vec)
KMeans(init='random', n_clusters=4, random_state=42)

Some insights about the data

news['clusters'] = kmeans.labels_
words = vectorizer.get_feature_names_out()[0:10]
words
array(['aa', 'aaa', 'aaron', 'ab', 'abandon', 'abandoned', 'abbey', 'abc',
       'abiding', 'ability'], dtype=object)
# Greater TFIDF values
review_groups = news.clusters.value_counts()
review_groups
3    3007
2     781
0     700
1     538
Name: clusters, dtype: int64

Checking clusters

words = vectorizer.get_feature_names_out()

# 14 most common words from each group
common_words = kmeans.cluster_centers_.argsort()[:,-1:-15:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + " : " + ', '.join(words[word] for word in centroid))
0 : game, team, player, year, season, play, hockey, win, last, league, fan, good, would, think
1 : god, jesus, christian, one, would, people, bible, say, believe, church, faith, belief, know, think
2 : window, file, thanks, program, driver, anyone, card, know, please, email, use, graphic, image, using
3 : would, people, one, dont, think, like, get, know, right, gun, time, say, government, state

Renaming clusters names

  • Cluster 0 seem to be related with sports.
  • Cluster 1 seem to contain news about religion.
  • Cluster 2 seem related to technology in general.
  • Cluster 3 is related to politics.
# Renaming clusters
def cluster_names(cluster):
    if cluster == 0:
        return 'sports'
    elif cluster == 1:
        return "religion"
    elif cluster == 2:
        return 'technology'
    elif cluster == 3:
        return 'politics'
    
news.clusters = news.clusters.apply(cluster_names)
# show random samples of each group
def sample_reviews(data, amount = 1):
    
    print('')
    for i in data.clusters.unique():
        print(f'cluster: {i}\n')
        for j in range(0, amount):
            print(data['content'][data['clusters'] == i].sample(1))
            print('')
        print(f"{'-' * 90}")
        
sample_reviews(news, 1)
cluster: religion

721    (Dean and I write lots and lots about absolute...
Name: content, dtype: object

------------------------------------------------------------------------------------------
cluster: politics

1576    %>I dunno, Lemieux?  Hmmm...sounds like he\n%>...
Name: content, dtype: object

------------------------------------------------------------------------------------------
cluster: sports

4426    \nPeople are seeming to be less concerned abou...
Name: content, dtype: object

------------------------------------------------------------------------------------------
cluster: technology

185    We have been using Iterated Systems compressio...
Name: content, dtype: object

------------------------------------------------------------------------------------------
sns.countplot(x = news.clusters)
plt.title('Topics Distributions')
plt.show()

png

Number of words

Counting the number of words for each news.

# num words
pd.set_option('max_colwidth', 100)
news['num_words'] = news["filtered_content"].apply(lambda x: x.split()).apply(len)
news.sample(2)
content filtered_content clusters num_words
718 The most ridiculous example of VR-exploitation I've seen so far is the\n"Virtual Reality Clothin... the ridiculous example vrexploitation ive seen far virtual reality clothing company recently ope... politics 36
2724 Please reply via EMail...\n\nWhen I use the terminal software for Windows such as TERMINAL.EXE o... please reply via email when i use terminal software window terminalexe crossttalk doesnt use who... technology 46

Number of unique words

# number of unique words
news['num_vocab'] = news['filtered_content'].apply(lambda x: x.split()).apply(set).apply(len)
news.sample(2)
content filtered_content clusters num_words num_vocab
3911 \nIt is more appropriate to address netters with their names as they appear in\ntheir signatures... it appropriate address netters name appear signature i failed since bother sign posting not poli... politics 81 74
3685 Hi, \n\nI have a simple question. Is it possible to create a OVERLAPPED THICKFRAME\nwindow witho... hi i simple question is possible create overlapped thickframe window without title bar ie wsover... technology 89 60

Lexical Diversity

  • Lexical diversity can help us understand how complex a text is.
  • Texts that are lexically diverse use a wide range of vocabulary, avoid repetition, use precise language and tend to use synonyms to express ideas.
news['lexical_div'] = news['num_words'] / news['num_vocab']
news.sample(2)
content filtered_content clusters num_words num_vocab lexical_div
2082 Hello, hello politics 1 1 1.000000
1328 \n\nAs to what that headpiece is....\n\n(by chort@crl.nmsu.edu)\n\nSOURCE: AP NEWSWIRE\n\nThe Va... a headpiece chortcrlnmsuedu source ap newswire the vatican home of genetic misfit michael a gill... politics 161 127 1.267717

Average word length

news["num_char"]=news["filtered_content"].str.len()
news['avg_word_length'] = news['num_char'] / news['num_words']
news.sample(2)
content filtered_content clusters num_words num_vocab lexical_div num_char avg_word_length
4753 HHHHEEEELLLLPPPP Meeeeeee!\n\n\tI installed a 256 color svga driver for my windows last week. ... hhhheeeellllpppp meeeeeee i installed color svga driver window last week this driver downloaded ... technology 77 53 1.452830 534 6.935065
2879 {Dan Johnson asked for evidence that the most effective abuse \nrecovery programs involve meetin... dan johnson asked evidence effective abuse recovery program involve meeting people spiritual nee... politics 38 29 1.310345 276 7.263158

Applying PCA (Principal Component Analysis)

  • PCA helps us to visualize the clustered data.
  • As the name suggests, it reduces dimensionality of data points leaving only the principal components to be able to create a scatter plot of the data.
  • It's useful for checking if we chose a optimal number for K at the clusterization step.
centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
# PCA Decomposition
pca = PCA(n_components = 2, random_state = 42)
pca_vecs = pca.fit_transform(vec.toarray())
x0 = pca_vecs[:, 0]
x1 = pca_vecs[:, 1]
news['x0'] = x0
news['x1'] = x1
import seaborn as sns

plt.figure(figsize = (12, 7))
plt.title("TFIDF + KMeans clustering")
plt.xlabel("X0", fontdict = {'fontsize': 16})
plt.ylabel("X1", fontdict = {'fontsize': 16})
sns.scatterplot(data = news, x = 'x0', y = 'x1', hue = 'clusters', palette = 'tab10')
plt.show()

png

An interactive View

In this plot you can hover inside the chart and choose the point you want to check on. Hover over them and check if the news are related to its assigned cluster.

# transforming the words vocabulary into a dataframe
clusters = vectorizer.vocabulary_

data  = {'Words': list(clusters.keys()),
         'Count': list(clusters.values())}
df2 = pd.DataFrame(data)
# Visualizing groups

alt.Chart(news.sample(4000)).mark_circle(
    size = 100
).encode(
    x = 'x0',
    y = 'x1',
    color = 'clusters:N',
    tooltip = 'content'
).properties(
    width=750,
    height=500
).interactive()

The End!

Please feel free to contact me if i did something wrong.

Full code HERE