Before applying any transformations to the features of a dataset, it is often necessary to seek answers to questions like the following:
- Are the values primarily clustered around the median?
- Alternatively, do they exhibit clustering at the extremes with a dearth of values in the middle range?
These inquiries go beyond median and mean values alone and are essential for obtaining a comprehensive understanding of the dataset. We can use a Violin plot for answering these questions.
What is Violin Plot
Violin Plot is a method to visualize the distribution of numerical data of different variables. It is similar to Box Plot but with a rotated plot on each side, giving more information about the density estimate on the y-axis. The density is mirrored and flipped over and the resulting shape is filled in, creating an image resembling a violin. The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot. On the other hand, the boxplot more clearly shows the outliers in the data. Violin Plots hold more information than box plots, they are less popular. Because of their unpopularity, their meaning can be harder to grasp for many readers not familiar with the violin plot representation.
How to Understand Violin Plot
The violin plot uses a kernel density estimation technique for deciding the boundary of the plot. A Kernel density estimation (KDE) is a statistical technique that is used to estimate the probability density function (PDF) of a random variable based on a set of observed data points. It provides a smooth and continuous estimate of the underlying distribution from which the data is assumed to be generated.
A violin plot consists of four components.
- A white Centered Dot at the middle of the graph – The white dot point at the middle is the median of the distribution.
- A thin gray bar inside the plot – The bar in the plot represents the Quartile range of the distribution
- A long thin line coming outside from the bar – The thin line represents the rest of the distribution which is calculated by the formulae Q1-1.5 IQR for the lower range and Q3+1.5 IQR for the upper range. The point lying beyond this line are considered as outliers
- A line boundary separating the plot- A kDE plot is used for defining the boundary of the violin plot it represents the distribution of data points
Types of Violin Plot
Violin plots can be used for two types of analysis.
- Univariate Analysis – In univariate analysis, violin plots are used to visualize the distribution of a single continuous variable. The plot displays the density estimation of the variable’s values, typically with a combination of a kernel density plot and a mirrored histogram. The width of the violin represents the density of data points at different values, with wider sections indicating higher density.
Code for Univariate violin plot
Python3
import matplotlib.pyplot as plt import numpy as np # Generate random data np.random.seed( 1 ) data = np.random.randn( 100 ) # Create a violin plot plt.figure() plt.violinplot(data, showmedians = True ) # Set plot labels and title plt.xlabel( 'Variable' ) plt.ylabel( 'Value' ) plt.title( 'Univariate Violin Plot' ) # Show the plot plt.show() |
Output:
- Bivariate Analysis – In bivariate analysis, violin plots are utilized to examine the relationship between a continuous variable and a categorical variable. The categorical variable is represented on the x-axis, while the y-axis represents the values of the continuous variable. By creating separate violins for each category, the plot visualizes the distribution of the continuous variable for different categories.
Python3
import matplotlib.pyplot as plt import numpy as np # Generate random data np.random.seed( 2 ) data1 = np.random.normal( 0 , 1 , 100 ) data2 = np.random.normal( 2 , 1.5 , 100 ) data3 = np.random.normal( - 2 , 0.5 , 100 ) categories = [ 'Category 1' , 'Category 2' , 'Category 3' ] all_data = [data1, data2, data3] # Create a violin plot plt.figure() plt.violinplot(all_data, showmedians = True ) # Set plot labels and title plt.xlabel( 'Category' ) plt.ylabel( 'Value' ) plt.title( 'Bivariate Violin Plot' ) # Set x-axis tick labels plt.xticks(np.arange( 1 , len (categories) + 1 ), categories) # Show the plot plt.show() |
Output:
Python Implementation of Volin Plot on Custom Dataset
Loading Libraries
Python3
import numpy as np import pandas as pd import seaborn as sns from matplotlib import pyplot import seaborn from sklearn.datasets import load_iris |
Loading Data
Python3
# Load the Iris dataset iris = load_iris() # Create a DataFrame from the # features (X) with column names df = pd.DataFrame(data = iris.data,\ columns = iris.feature_names) # Add the target variable (y) to the DataFrame df[ 'target' ] = iris.target # Display the first five rows of the DataFrame print (df.head( 5 )) |
Output:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target 0 5.1 3.5 1.4 0.2 0 1 4.9 3.0 1.4 0.2 0 2 4.7 3.2 1.3 0.2 0 3 4.6 3.1 1.5 0.2 1 4 5.0 3.6 1.4 0.2 0
Description of the dataset
Python3
df.describe() |
Output:
Information About the Dataset
Python3
df.info() |
Output:
Describing the ‘SepalLengthCm’ feature of the Iris dataset.
Python3
df[ "SepalLengthCm" ].describe() |
Output:
count 150.000000 mean 5.843333 std 0.828066 min 4.300000 25% 5.100000 50% 5.800000 75% 6.400000 max 7.900000 Name: SepalLengthCm, dtype: float64
Univariate Violin Plot for ‘SepalLengthCm’ Feature.
Python3
fig, ax = pyplot.subplots(figsize = ( 9 , 7 )) sns.violinplot( ax = ax, y = data["SepalLengthCm"] ) |
Output:
As you can see we have a higher density between 5 and 6. That is very significant because as in the SepalLengthCm description, a mean value is at 5.43.
Univariate Violin Plot for the ‘SepalLengthWidth’ feature.
Python3
fig, ax = pyplot.subplots(figsize = ( 9 , 7 )) sns.violinplot(ax = ax, y = data["SepalWidthCm"] ) |
Output:
Here also, Higher density is at the mean = 3.05 \
Bivariate Violin Plot comparing ‘SepalLengthCm’ and ‘SepalWidthCm’.
Python3
fig, ax = pyplot.subplots(figsize = ( 9 , 7 )) sns.violinplot(ax = ax, data = data.iloc[:, 1 : 3 ]) |
Output:
Bivariate Violin Plot comparing ‘SepalLengthCm’ species-wise.
Python3
fig, ax = pyplot.subplots(figsize = ( 9 , 7 )) sns.violinplot(ax = ax, x = data["Species"], y = data["SepalLengthCm"] ) |
Output: