Hierarchical clustering is a popular and widely used method for grouping and organizing data into clusters. In hierarchical clustering, data points are grouped based on their similarity, with similar data points being placed in the same cluster and dissimilar data points being placed in different clusters.
One of the key decisions in hierarchical clustering is how to determine the similarity between data points and how to merge clusters. In scikit-learn, the two most commonly used methods for this are structured Ward and unstructured Ward.
Structured Ward Clustering:
- Structured Ward is a hierarchical clustering method that uses a ward linkage criterion to measure the similarity between data points.
- This criterion measures the sum of squared distances between all pairs of data points in the two clusters being merged.
- The goal of this method is to minimize the total within-cluster variance, resulting in clusters that are compact and well-separated.
Unstructured Ward Clustering:
- Unstructured Ward is a hierarchical clustering method that uses a similar ward linkage criterion but applies it to a different distance metric.
- Instead of measuring the sum of squared distances, unstructured Ward uses a Euclidean distance metric to measure the similarity between data points.
- This results in clusters that are more flexible and allow for more complex shapes and structures.
Conditional Use cases:
- Overall, both structured and unstructured wards are effective methods for hierarchical clustering in scikit-learn.
- The choice between the two methods depends on the characteristics of the data and the desired properties of the clusters.
- Structured Ward is generally better suited for data with compact and well-separated clusters.
- Unstructured Ward is better suited for data with more complex and flexible cluster structures.
- Let’s see it in the example below.
- Silhouette score will show which algorithms perform better on complex datasets.
Here is an example of how to use hierarchical clustering with structured and unstructured Ward in scikit-learn:
Python code for Structured Ward clustering:
Python3
from sklearn.datasets import make_blobs from sklearn.cluster import AgglomerativeClustering from sklearn.metrics import silhouette_score # Generate some sample data X, y = make_blobs(n_samples = 10000 , n_features = 8 , centers = 5 ) # Create a structured Ward hierarchical clustering object structured_ward = AgglomerativeClustering(n_clusters = 5 , linkage = 'ward' ) structured_ward.fit(X) # Print the labels for each data point print ( "Structured Ward labels:" , structured_ward.labels_) print (silhouette_score(X, structured_ward.labels_)) |
Output:
Structured Ward labels: [2 4 3 ... 3 4 0] 0.6958103589455868
Python code for Unstructured Ward clustering:
Python3
from sklearn.datasets import make_blobs from sklearn.cluster import AgglomerativeClustering # Generate some sample data X, y = make_blobs(n_samples = 10000 , n_features = 8 , centers = 5 ) # Create an unstructured Ward # hierarchical clustering object unstructured_ward = AgglomerativeClustering( n_clusters = 5 , linkage = 'ward' , affinity = 'euclidean' ) unstructured_ward.fit(X) print ( "Unstructured Ward labels:" , unstructured_ward.labels_) print (silhouette_score(X, unstructured_ward.labels_)) |
Output:
Unstructured Ward labels: [3 0 2 ... 1 4 0] 0.7733847795088261
- This code generates some sample data using the make_blobs function and then uses the AgglomerativeClustering class to perform hierarchical clustering with structured and unstructured Ward.
- The structured_ward object uses the default ward linkage criterion with the sum of squared distances.
- The unstructured_ward object uses the Euclidean distance metric.
- The labels for each data point are then printed for each clustering method.
- The above example is for a complex dataset where an unstructured ward performs way better than the structured ward algorithm
You can try running this code yourself to see the results and compare the clusters produced by structured and unstructured Ward. You can also try experimenting with different settings and parameters to see how they affect the clustering results.