Wednesday, January 8, 2025
Google search engine
HomeData Modelling & AIClustering | Introduction, Different Methods, and Applications (Updated 2023)

Clustering | Introduction, Different Methods, and Applications (Updated 2023)

12 minutes






Rating: 5 out of 5.

Introduction

Your first reaction when you come across an unsupervised learning problem for the first time may simply be confusion since you are not looking for specific insights. What you are looking for are structures within the data without them being tied down to a specific outcome. The method of identifying similar groups of data in a large dataset is called clustering or cluster analysis.

It is one of the most popular clustering techniques in data science used by data scientists. Entities in each group are comparatively more similar to entities of that group than those of the other groups. In this article, I will be taking you through the types of clustering, different clustering algorithms, and a comparison between two of the most commonly used clustering methods.

Note: To learn more about clustering and other machine learning algorithms (both supervised and unsupervised) check out the following courses-

Learning Objectives

  • Learn about Clustering in machine learning, one of the most popular unsupervised classification techniques.
  • Get to know K means and hierarchical clustering and the difference between the two.

Table of Contents

What Is Clustering?

Clustering is the task of dividing the unlabeled data or data points into different clusters such that similar data points fall in the same cluster than those which differ from the others. In simple words, the aim of the clustering process is to segregate groups with similar traits and assign them into clusters.

Let’s understand this with an example. Suppose you are the head of a rental store and wish to understand the preferences of your customers to scale up your business. Is it possible for you to look at the details of each customer and devise a unique business strategy for each one of them? Definitely not. But, what you can do is cluster all of your customers into, say 10 groups based on their purchasing habits and use a separate strategy for customers in each of these 10 groups. And this is what we call clustering.

Now that we understand what clustering is. Let’s take a look at its different types.

Types of Clustering

Broadly speaking, clustering can be divided into two subgroups:

  • Hard Clustering: In this, each input data point either belongs to a cluster completely or not. For example, in the above example, each customer is put into one group out of the 10 groups.
  • Soft Clustering: In this, instead of putting each input data point into a separate cluster, a probability or likelihood of that data point being in those clusters is assigned. For example, from the above scenario, each customer is assigned a probability to be in either of the 10 clusters of the retail store.

Different Types of Clustering Algorithms

Since the task of clustering is subjective, the means that can be used for achieving this goal are plenty. Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are used popularly. Let’s look at them in detail:

  • Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start by classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lack scalability for handling big datasets. Examples of these models are the hierarchical clustering algorithms and their variants.
  • Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid or cluster center of the clusters. The k-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of cluster parameters required at the end has to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
  • Distribution models: These clustering models are based on the notion of how probable it is that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is the Expectation-maximization algorithm which uses multivariate normal distributions.
  • Density models: These models search the data space for areas of the varied density of data points in the data space. They isolate different dense regions and assign the data points within these regions to the same cluster. Popular examples of density models are DBSCAN and OPTICS. These models are particularly useful for identifying clusters of arbitrary shape and detecting outliers, as they can detect and separate points that are located in sparse regions of the data space, as well as points that belong to dense regions.

Now I will be taking you through two of the most popular clustering algorithms in detail – K Means and Hierarchical. Let’s begin.

K Means Clustering

K means is an iterative clustering algorithm that aims to find local maxima in each iteration. This algorithm works in these 5 steps:

1. Specify the desired number of clusters K: Let us choose k=2 for these 5 data points in 2-D space.

clustering, k-means

2. Randomly assign each data point to a cluster: Let’s assign three points in cluster 1, shown using red color, and two points in cluster 2, shown using grey color.

k-means clustering

3. Compute cluster centroids: The centroid of data points in the red cluster is shown using the red cross, and those in the grey cluster using a grey cross.

k-means, centroid

4. Re-assign each point to the closest cluster centroid: Note that only the data point at the bottom is assigned to the red cluster, even though it’s closer to the centroid of the grey cluster. Thus, we assign that data point to the grey cluster.

centroid

5. Re-compute cluster centroids: Now, re-computing the centroids for both clusters.

clustering, centroid, k-means

Repeat steps 4 and 5 until no improvements are possible: Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima, i.e., when there is no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned.

Here is a live coding window where you can try out K Means Algorithm using the scikit-learn library.



Hierarchical Clustering

Hierarchical clustering, as the name suggests, is an algorithm that builds a hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

The results of hierarchical clustering can be shown using a dendrogram. The dendrogram can be interpreted as:

hierarchical clustering, dendogram

At the bottom, we start with 25 data points, each assigned to separate clusters. The two closest clusters are then merged till we have just one cluster at the top. The height in the dendrogram at which two clusters are merged represents the distance between two clusters in the data space.

The decision of the no. of clusters that can best depict different groups can be chosen by observing the dendrogram. The best choice of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that can transverse the maximum distance vertically without intersecting a cluster.

In the above example, the best choice of no. of clusters will be 4 as the red horizontal line in the dendrogram below covers the maximum vertical distance AB.

hierarchical clustering

Two important things that you should know about hierarchical clustering are:

  1. This algorithm has been implemented above using a bottom-up approach. It is also possible to follow a top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.
  2. The decision to merge two clusters is taken on the basis of the closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters:
  • Euclidean distance: ||a-b||2 = √(Σ(ai-bi))
  • Squared Euclidean distance: ||a-b||22 = Σ((ai-bi)2)
  • Manhattan distance: ||a-b||1 = Σ|ai-bi|
  • Maximum distance:||a-b||INFINITY = maxi|ai-bi|
  • Mahalanobis distance: √((a-b)T S-1 (-b))   {where, s : covariance matrix}

Difference Between K Means and Hierarchical Clustering

  • Hierarchical clustering can’t handle big data well, but K Means can. This is because the time complexity of K Means is linear, i.e., O(n), while that of hierarchical is quadratic, i.e., O(n2).
  • Since we start with a random choice of clusters, the results produced by running the algorithm multiple times might differ in K Means clustering. While in Hierarchical clustering, the results are reproducible.
  • K Means is found to work well when the shape of the clusters is hyperspherical (like a circle in 2D or a sphere in 3D).
  • K Means clustering requires prior knowledge of K, i.e., no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram.

Applications of Clustering

Clustering has a large no. of applications spread across various domains. Some of the most popular applications of clustering are recommendation engines, market segmentation, social network analysis, search result grouping, medical imaging, image segmentation, and anomaly detection.

Improving Supervised Learning Algorithms With Clustering

Clustering is an unsupervised machine learning approach, but can it be used to improve the accuracy of supervised machine learning algorithms as well by clustering the data points into similar groups and using these cluster labels as independent variables in the supervised machine learning algorithm? Let’s find out.

Let’s check out the impact of clustering on the accuracy of our model for the classification problem using 3000 observations with 100 predictors of stock data to predict whether the stock will go up or down using R. This dataset contains 100 independent variables from X1 to X100 representing the profile of a stock and one outcome variable Y with two levels: 1 for the rise in stock price and -1 for drop in stock price.

The dataset is available here: Download

Let’s first try applying random forest without clustering in python.

#loading required libraries
library('randomForest')

library('Metrics')

#set random seedset.seed(101)

#loading dataset

data<-read.csv("train.csv",stringsAsFactors= T)

#checking dimensions of datadim(data)

## [1] 3000  101

#specifying outcome variable as factor


 data$Y<-as.factor(data$Y)

#dividing the dataset into train and testtrain<-data[1:2000,]
test<-data[2001:3000,]

#applying randomForest model_rf<-randomForest(Y~.,data=train)

preds<-predict(object=model_rf,test[,-101])

table(preds)

## preds
##  -1   1
## 453 547

#checking accuracy

auc(preds,test$Y)

## [1] 0.4522703

So, the accuracy we get is 0.45. Now let’s create five clusters based on values of independent variables using k-means and reapply random forest.

#combing test and train

all<-rbind(train,test)

#creating 5 clusters using K- means clustering

Cluster <- kmeans(all[,-101], 5)

#adding clusters as independent variable to the dataset.all$cluster<-as.factor(Cluster$cluster)

#dividing the dataset into train and testtrain<-all[1:2000,]
test<-all[2001:3000,]

#applying randomforestmodel_rf<-randomForest(Y~.,data=train)

preds2<-predict(object=model_rf,test[,-101])

table(preds2)

## preds2

## -1   1 

##548 452 

auc(preds2,test$Y)

## [1] 0.5345908

Whoo! In the above example, even though the final accuracy is poor but clustering has given our model a significant boost from an accuracy of 0.45 to slightly above 0.53.

This shows that clustering can indeed be helpful for supervised machine-learning tasks.

Conclusion

In this article, we have discussed the various ways of performing clustering. We came across applications for unsupervised learning in a large no. of domains and also saw how to improve the accuracy of a supervised machine learning algorithm using clustering.

Although clustering is easy to implement, you need to take care of some important aspects, like treating outliers in your data and making sure each cluster has a sufficient population. These aspects of clustering are dealt with in great detail in this article.

Key Takeaways

  • Clustering helps to identify patterns in data and is useful for exploratory data analysis, customer segmentation, anomaly detection, pattern recognition, and image segmentation.
  • It is a powerful tool for understanding data and can help to reveal insights that may not be apparent through other methods of analysis.
  • Its types include partition-based, hierarchical, density-based, and grid-based clustering.
  • The choice of clustering algorithm and the number of clusters to use depend on the nature of the data and the specific problem at hand.

Frequently Asked Questions

Q1. What is agglomerative clustering, and how does it work?

A. Agglomerative clustering is a popular data mining technique that groups data points based on their similarity, using a distance metric such as Euclidean distance. Different distance measures can be used depending on the type of data being analyzed. It is a bottom-up approach that merges similar clusters iteratively, and the resulting hierarchy can be represented as a dendrogram.

Q2. What are the advantages of DBSCAN?

A. DBSCAN (density-based spatial clustering of applications) has several advantages over other clustering algorithms, such as its ability to handle data with arbitrary shapes and noise and its ability to automatically determine the number of clusters. It is also computationally efficient and can scale to large datasets.

Q3. What is a hierarchical clustering structure?

A. A hierarchical clustering structure is a type of clustering structure that forms a tree-like structure of clusters, with the individual data points at the bottom and the root node at the top. It can be further divided into agglomerative and divisive hierarchical clustering.

Saurav is a Data Science enthusiast, currently in the final year of his graduation at MAIT, New Delhi. He loves to use machine learning and analytics to solve complex data problems.

RELATED ARTICLES

Most Popular

Recent Comments