In this article, we will see how to cut a hierarchical dendrogram into clusters via a threshold value using SciPy in Python.
A dendrogram is a type of tree diagram showing hierarchical clustering i.e. relationships between similar sets of data. It is used to analyze the hierarchical relationship between the different classes. The scipy.cluster package equips us with tools needed for hierarchical clustering and dendrogram plotting. Thus, has to be imported into the environment.
Let us first create some sample data and plot it normally. We have taken a bunch of random data points as our input, we would be plotting their dendrogram later.
Example: Sample program for creating and visualizing data
Python3
# Importing the libraries from scipy.cluster import hierarchy from scipy.cluster.hierarchy import dendrogram import numpy as np import matplotlib.pyplot as plt # The data points are given as list of lists data = np.array([ [ 1 , 4 ], [ 2 , 2 ], [ 3 , 7 ], [ 4 , 6 ], [ 5 , 1 ], [ 6 , 3 ], [ 8 , 10 ], [ 9 , 11 ] ]) # Taking transpose x, y = data.T # plot our list in X,Y coordinates plt.scatter(x, y) plt.show() |
Output:
The dendrogram can be plotted easily using the linkage matrix. A linkage matrix is created via linkage() function. This matrix contains an encoding of the hierarchical clustering to render as a dendrogram.
Syntax:
hierarchy.linkage(y, method=’single’, metric=’euclidean’, optimal_ordering=False):
Parameters:
- y: Input 1D/ 2D array of input vector
- method: methods for calculating the distance between the newly formed cluster and other points. method = ‘single’ , ‘complete’, ‘average’, ‘centroid’
- metric: distance metric to use in the case that input is a collection of observation vectors
- optimal_ordering: If True, the linkage matrix will be reordered so that the distance between successive leaves is minimal
Example: Creating a Dendrogram for our data
Python3
# Creating Dendrogram for our data # Z = linkage matrix Z = hierarchy.linkage(data, method = 'average' ) plt.figure() plt.title( "Dendrograms" ) # Dendrogram plotting using linkage matrix dendrogram = hierarchy.dendrogram(Z) |
Output:
Now, let’s cut the dendrogram by a threshold value. We have chosen a cut-off or threshold value of 4. At this value, a vertical line can also be drawn.
For a chosen cut-off/threshold value, we can always simply count the number of intersections with vertical lines of the dendrogram to get the number of formed clusters. Say we choose a cut-off of max_d = 6, we’d get 2 final clusters.
Example: Cutting dendrogram at the threshold value
Python3
# Creating Dendrogram for our data # max_d = cut-off/ Threshold value max_d = 4 Z = hierarchy.linkage(data, method = 'average' ) plt.figure() plt.title( "Dendrograms" ) dendrogram = hierarchy.dendrogram(Z) # Cutting the dendrogram at max_d plt.axhline(y = max_d, c = 'k' ) |
Output: