K-means Clustering in Machine Learning

K-means Clustering in Machine Learning

5 mins read2.7K Views Comment
Updated on Mar 17, 2023 14:18 IST

When you are dealing with Machine Learning problems that work with unlabeled training datasets, the most common learning algorithms you will come across are clustering algorithms. Amongst them, the simplest and the most popular is K-means clustering.

2022_04_K-means-Clustering-in-Machine-Learning.jpg

Typically, a good clustering algorithm groups the data into clusters such that the data points within each cluster are more similar to each other than the data points belonging to other clusters. Clustering algorithms find their usage in many applications including pattern detection, medical diagnosis, market segmentation, customer segmentation, etc. Let’s understand K-meanS Clustering in Machine Learning.

Explore Online Machine Learning Courses

In this blog, we will go through the following sections:

What is K-means Clustering?

K-means is a clustering algorithm designed to divide data points into distinct clusters depending on the similarities between the observations. Data points that share certain relevant characteristics are grouped together, implying that data points in different clusters would be dissimilar from each other.

This is an unsupervised learning algorithm, which means you can train a model to create clusters on some given data without needing to label it first.

Diagram

Description automatically generated
Recommended online courses

Best-suited Machine Learning courses for you

Learn Machine Learning with these high-rated online courses

2.5 L
2 years
2.5 L
2 years
1.53 L
11 months
34.65 K
11 months
5.6 L
18 months
– / –
8 hours
– / –
6 months

How to choose the value of K?

The number of clusters ‘K’ you want to group your data points into, has to be pre-defined. Determining the optimal number of clusters for a given data is a fundamental issue in K-means clustering. The answer to this question is somewhat subjective and mostly depends on the methods used for partitioning. The most commonly used method in the industry to identify the value of K is the elbow method.

The Elbow Method

In this method, the idea behind partitioning is to define clusters such that total intra-cluster variation or the total within the sum of squares (WSS) is minimized for each cluster. WSS measures the compactness of the clusters – so you’d want the value to be as small as possible.

The method consists of plotting the total WSS as a function of the various number of clusters. And then picking the ‘elbow’ of the curve as the appropriate number of clusters to use. See the example below.

Chart, line chart

Description automatically generated

The motive behind this method is the number of clusters should be chosen such that adding another cluster would not increase the total WSS.

The code to generate a graph and the modeling component of this algorithm is provided below in the Demo: Implementing K-means Clustering section of this blog.

How does the K-means algorithm work?

  1. Choose the value of K.

For example, let’s assume the value of K is 3. 

  1. Select ‘K’ data points each of which represents a cluster centroid.

Since K=3, we will choose 3 random data points as centroids.

Chart, scatter chart

Description automatically generated
  1. Assign the rest of the data points to their nearest cluster centroids.

We will calculate the Euclidean distance between the data points and the centroids. The data point at a minimum distance with a given centroid is assigned to its cluster.

If we consider two points Pp1,p2 and Q(q1,q2) in a 2D space, the shortest distance between these two points is given as the length of the hypotenuse of a right triangle. This is the Euclidean distance between the two points.

2022_04_image-49.jpg

The value of h is the Euclidean distance.

Chart, scatter chart

Description automatically generated
  1. Reposition the cluster centroid after each data point assignment till it’s the average of all data points in that cluster.

Once a data point is assigned to the nearest cluster, the cluster centroid is re-calculated, including the new point.

Chart, scatter chart

Description automatically generated
  1. Repeat steps 3 and 4 till there are no more changes in each cluster.
Chart, scatter chart

Description automatically generated

After the final iteration, this is how the clusters are formed:

A picture containing sky

Description automatically generated

K Means Clustering Demo in Python – Try it yourself

Click the below colab icon to run and practice the demo.

google-collab

Demo: Implementing K-means Clustering – Explanation

Problem Statement:

We will demonstrate how to implement K-means clustering on a given dataset. We will determine the value of K using the Elbow Method we have studied above.

Please keep in mind that K-means clustering is not performed on categorical data.

Dataset Description:

We will be creating a dataset using the make_blobs API from the Scikit-learn library. This API is used to create multi-class datasets by allocating each class to one or more normally distributed clusters.

Tasks to be performed:

  1. Generate the data 
  2. Visualize the data clusters
  3. Plot the Elbow Curve
  4. Build and Visualize the K-means Clustering Model

Step 1 – Generate the data

Let us generate a dataset with, say, 12 random clusters in a 2D space. For that, we will initialize this AdaBoost ensemble model with the following parameters:

  • n_samples = 1000 – Create 1000 samples
  • n_features = 2 – The dataset will have two features
  • centers = 12 – Create 12 random clusters with 12 centroids
 
import pandas as pd
from sklearn.datasets import make_blobs
#Generate the data
X, y = make_blobs(n_samples=1000, n_features=2, centers=12, random_state=42)
#Create a DataFrame for the dataset features
df = pd.DataFrame(X, columns=['feature1','feature2'])
Copy code

Step 2 – Visualize the data clusters 

We will use a Seaborn scatter plot to visualize our data with its 2 features:

 
import seaborn as sns
import matplotlib.pyplot as plt
#Create a scatter plot using seaborn
fig = plt.figure (figsize=(10,6))
sns.scatterplot(x='feature1', y='feature2', data=df)
Copy code
Chart, scatter chart

Description automatically generated

Though we specified 12 centers, the plot above shows an overlap between some clusters. So, have only 7-8 separable clusters right now.

Step 3 – Plot the Elbow Curve 

We will use the Yellowbrick library, which is a wrapper around Scikit-learn. This library will help us plot the Elbow curve with only one line of code:

 
from yellowbrick.cluster import KElbowVisualizer
model = KMeans()
#K is the range of number of clusters
visualizer = KElbowVisualizer(model, k=(2,18), timings=False)
visualizer.fit(X) #Fit the data to the visualizer
visualizer.show() #Display the figure
Copy code
Chart, line chart

Description automatically generated

We found that the optimal value of the number of clusters using the Elbow Method is 6. Let’s see how our model clusters the data when K=6.

Step 4 – Build and visualize the K-means Clustering Model 

Let’s see how our model clusters the data when K=6:

 
from sklearn.cluster import KMeans
#Create the model
model = KMeans(n_clusters=6)
#Fit values to the model
model.fit(X)
#Predict the output labels 
df['y_pred']= model.predict(X)
#Convert the output column to category
df['y_pred']= df['y_pred'].astype('category')
#Plot the clusterd data using seaborn
fig = plt.figure (figsize=(10,6))
sns.scatterplot(x='feature1', y='feature2', hue='y_pred', data=df)
Copy code
Chart, scatter chart

Description automatically generated

Thus, we have a perfectly grouped data divided into 6 defined clusters.

What is Polynomial Regression in Machine Learning?
Transfer Learning in Machine Learning: Techniques for Reusing Pre-Trained model
Machine Learning for Fraud Detection

Assumptions of K-means Clustering

Like most machine learning models, K-means clustering also holds certain assumptions about the data they fit on. It is important to check on these assumptions before we rely on our trained model:

  1. K-means cannot handle non-globular structure

A given data can have any type or number of patterns that can be interpreted visually. Since the K-means algorithm uses centroids as a reference, it does not capture the information well if the data has a non-globular pattern.

Chart, scatter chart

Description automatically generated
  1. K-means is susceptible to outliers. 

We already know that in K-means, centroids are re-calculated repeatedly by averaging the data points in the cluster. Averages are sensitive to outliers. So, it becomes necessary to spot and remove outliers before training our model.

Endnotes

K-means clustering is a powerful and extensively used technique for data cluster analysis in Machine Learning. The only tough part of the algorithm is identifying an ideal value of K. However, due to its limitations, its performance might not be the best among other clustering techniques because the slightest variations in the data could lead to a high variance error.

About the Author

This is a collection of insightful articles from domain experts in the fields of Cloud Computing, DevOps, AWS, Data Science, Machine Learning, AI, and Natural Language Processing. The range of topics caters to upski... Read Full Bio