Violin Plot for Data Analysis

27 July 2024

3

Before applying any transformations to the features of a dataset, it is often necessary to seek answers to questions like the following:

Are the values primarily clustered around the median?
Alternatively, do they exhibit clustering at the extremes with a dearth of values in the middle range?

These inquiries go beyond median and mean values alone and are essential for obtaining a comprehensive understanding of the dataset. We can use a Violin plot for answering these questions.

What is Violin Plot

Violin Plot is a method to visualize the distribution of numerical data of different variables. It is similar to Box Plot but with a rotated plot on each side, giving more information about the density estimate on the y-axis. The density is mirrored and flipped over and the resulting shape is filled in, creating an image resembling a violin. The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot. On the other hand, the boxplot more clearly shows the outliers in the data. Violin Plots hold more information than box plots, they are less popular. Because of their unpopularity, their meaning can be harder to grasp for many readers not familiar with the violin plot representation.

How to Understand Violin Plot

The violin plot uses a kernel density estimation technique for deciding the boundary of the plot. A Kernel density estimation (KDE) is a statistical technique that is used to estimate the probability density function (PDF) of a random variable based on a set of observed data points. It provides a smooth and continuous estimate of the underlying distribution from which the data is assumed to be generated.

Violin plot Distribution Explanation

A violin plot consists of four components.

A white Centered Dot at the middle of the graph – The white dot point at the middle is the median of the distribution.
A thin gray bar inside the plot – The bar in the plot represents the Quartile range of the distribution
A long thin line coming outside from the bar – The thin line represents the rest of the distribution which is calculated by the formulae Q1-1.5 IQR for the lower range and Q3+1.5 IQR for the upper range. The point lying beyond this line are considered as outliers
A line boundary separating the plot- A kDE plot is used for defining the boundary of the violin plot it represents the distribution of data points

Types of Violin Plot

Violin plots can be used for two types of analysis.

Univariate Analysis – In univariate analysis, violin plots are used to visualize the distribution of a single continuous variable. The plot displays the density estimation of the variable’s values, typically with a combination of a kernel density plot and a mirrored histogram. The width of the violin represents the density of data points at different values, with wider sections indicating higher density.

Code for Univariate violin plot

Python3

import matplotlib.pyplot as plt
import numpy as np
 
# Generate random data
np.random.seed(1)
data = np.random.randn(100)
 
# Create a violin plot
plt.figure()
plt.violinplot(data, showmedians=True)
 
# Set plot labels and title
plt.xlabel('Variable')
plt.ylabel('Value')
plt.title('Univariate Violin Plot')
 
# Show the plot
plt.show()

Output:

Univariate Violin plot

Bivariate Analysis – In bivariate analysis, violin plots are utilized to examine the relationship between a continuous variable and a categorical variable. The categorical variable is represented on the x-axis, while the y-axis represents the values of the continuous variable. By creating separate violins for each category, the plot visualizes the distribution of the continuous variable for different categories.

Python3

import matplotlib.pyplot as plt
import numpy as np
 
# Generate random data
np.random.seed(2)
data1 = np.random.normal(0, 1, 100)
data2 = np.random.normal(2, 1.5, 100)
data3 = np.random.normal(-2, 0.5, 100)
categories = ['Category 1', 'Category 2', 'Category 3']
all_data = [data1, data2, data3]
 
# Create a violin plot
plt.figure()
plt.violinplot(all_data, showmedians=True)
 
# Set plot labels and title
plt.xlabel('Category')
plt.ylabel('Value')
plt.title('Bivariate Violin Plot')
 
# Set x-axis tick labels
plt.xticks(np.arange(1, len(categories) + 1), categories)
 
# Show the plot
plt.show()

Output:

Bivariate Violin plot

Python Implementation of Volin Plot on Custom Dataset

Loading Libraries

Python3

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
import seaborn
from sklearn.datasets import load_iris

Loading Data

Python3

# Load the Iris dataset
iris = load_iris()
 
# Create a DataFrame from the 
# features (X) with column names
df = pd.DataFrame(data=iris.data,\
                  columns=iris.feature_names)
 
# Add the target variable (y) to the DataFrame
df['target'] = iris.target
 
# Display the first five rows of the DataFrame
print(df.head(5))

Output:

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  target 
0                5.1               3.5                1.4               0.2   0
1                4.9               3.0                1.4               0.2   0
2                4.7               3.2                1.3               0.2   0
3                4.6               3.1                1.5               0.2   1
4                5.0               3.6                1.4               0.2   0

Description of the dataset

Python3

df.describe()

Output:

dataset description

Information About the Dataset

Python3

df.info()

Output:

Dataset description

Describing the ‘SepalLengthCm’ feature of the Iris dataset.

Python3

df["SepalLengthCm"].describe()

Output:

count    150.000000
mean       5.843333
std        0.828066
min        4.300000
25%        5.100000
50%        5.800000
75%        6.400000
max        7.900000
Name: SepalLengthCm, dtype: float64

Univariate Violin Plot for ‘SepalLengthCm’ Feature.

Python3

fig, ax = pyplot.subplots(figsize =(9, 7))
sns.violinplot( ax = ax, y = data["SepalLengthCm"] )

Output:

Violin Plot for ‘SepalLengthCm’

As you can see we have a higher density between 5 and 6. That is very significant because as in the SepalLengthCm description, a mean value is at 5.43.

Univariate Violin Plot for the ‘SepalLengthWidth’ feature.

Python3

fig, ax = pyplot.subplots(figsize =(9, 7))
sns.violinplot(ax = ax,  y = data["SepalWidthCm"] )

Output:

Violin Plot for the ‘SepalLengthWidth’ feature

Here also, Higher density is at the mean = 3.05 \

Bivariate Violin Plot comparing ‘SepalLengthCm’ and ‘SepalWidthCm’.

Python3

fig, ax = pyplot.subplots(figsize =(9, 7))
sns.violinplot(ax = ax, data = data.iloc[:, 1:3])

Output:

Violin Plot comparing ‘SepalLengthCm’ and ‘SepalWidthCm’

Bivariate Violin Plot comparing ‘SepalLengthCm’ species-wise.

Python3

fig, ax = pyplot.subplots(figsize =(9, 7))
sns.violinplot(ax = ax, x = data["Species"], 
                  y = data["SepalLengthCm"] )

Output:

Violin Plot comparing ‘SepalLengthCm’ species-wise

Violin Plot for Data Analysis

What is Violin Plot

How to Understand Violin Plot

Types of Violin Plot

Python3

Python3

Python Implementation of Volin Plot on Custom Dataset

Loading Libraries

Python3

Loading Data

Python3

Description of the dataset

Python3

Information About the Dataset

Python3

Describing the ‘SepalLengthCm’ feature of the Iris dataset.

Python3

Univariate Violin Plot for ‘SepalLengthCm’ Feature.

Python3

Univariate Violin Plot for the ‘SepalLengthWidth’ feature.

Python3

Bivariate Violin Plot comparing ‘SepalLengthCm’ and ‘SepalWidthCm’.

Python3

Bivariate Violin Plot comparing ‘SepalLengthCm’ species-wise.

Python3

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US