Anscombe’s quartet comprises a set of four dataset, having identical descriptive statistical properties in terms of means, variance, R-Squared, correlations, and linear regression lines but having different representations when we scatter plot on graph. The datasets were created by the statistician Francis Anscombe in 1973 to demonstrate the importance of visualizing data and to show that summary statistics alone can be misleading.
The four datasets that make up Anscombe’s quartet each include 11 x-y pairs of data. When plotted, each dataset seems to have a unique connection between x and y, with unique variability patterns and distinctive correlation strengths. Despite these variations, each dataset has the same summary statistics, such as the same x and y mean and variance, x and y correlation coefficient, and linear regression line.
Anscombe’s quartet is used to illustrate the importance of exploratory data analysis and the drawbacks of depending only on summary statistics. It also emphasizes the importance of using data visualization to spot trends, outliers, and other crucial details that might not be obvious from summary statistics alone.
The four datasets of Anscombe’s quartet.
The above dataset can be downloaded from the link [ https://query.data.world/s/6p2ntncvkzj5mnvbpkaswfilryvnrk ]
Anscombe’s quartet Implementations
Import the necessary libraries
Python3
# Import the necessary libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt |
Load the dataset
Python3
print (df) |
output:
x1 x2 x3 x4 y1 y2 y3 y4 0 10 10 10 8 8.04 9.14 7.46 6.58 1 8 8 8 8 6.95 8.14 6.77 5.76 2 13 13 13 8 7.58 8.74 12.74 7.71 3 9 9 9 8 8.81 8.77 7.11 8.84 4 11 11 11 8 8.33 9.26 7.81 8.47 5 14 14 14 8 9.96 8.10 8.84 7.04 6 6 6 6 8 7.24 6.13 6.08 5.25 7 4 4 4 19 4.26 3.10 5.39 12.50 8 12 12 12 8 10.84 9.13 8.15 5.56 9 7 7 7 8 4.82 7.26 6.42 7.91 10 5 5 5 8 5.68 4.74 5.73 6.89
Find the descriptive statistical properties for the all four dataset
- Find mean for x and y for all four datasets.
- Find standard deviations for x and y for all four datasets.
- Find correlations with their corresponding pair of each datasets.
- Find slope and intercept for each datasets.
- Find R-square for each datasets.
- To find R-square first find residual sum of square error and Total sum of square error
- Create a statistical summary by using all these data and print it.
Python3
# mean values (x-bar) x1_mean = df[ 'x1' ].mean() x2_mean = df[ 'x2' ].mean() x3_mean = df[ 'x3' ].mean() x4_mean = df[ 'x4' ].mean() # y-bar y1_mean = df[ 'y1' ].mean() y2_mean = df[ 'y2' ].mean() y3_mean = df[ 'y3' ].mean() y4_mean = df[ 'y4' ].mean() # Standard deviation values (x-bar) x1_std = df[ 'x1' ].std() x2_std = df[ 'x2' ].std() x3_std = df[ 'x3' ].std() x4_std = df[ 'x4' ].std() # Standard deviation values (y-bar) y1_std = df[ 'y1' ].std() y2_std = df[ 'y2' ].std() y3_std = df[ 'y3' ].std() y4_std = df[ 'y4' ].std() # Correlation correlation_x1y1 = np.corrcoef(df[ 'x1' ],df[ 'y1' ])[ 0 , 1 ] correlation_x2y2 = np.corrcoef(df[ 'x2' ],df[ 'y2' ])[ 0 , 1 ] correlation_x3y3 = np.corrcoef(df[ 'x3' ],df[ 'y3' ])[ 0 , 1 ] correlation_x4y4 = np.corrcoef(df[ 'x4' ],df[ 'y4' ])[ 0 , 1 ] # Linear Regression slope and intercept m1,c1 = np.polyfit(df[ 'x1' ], df[ 'y1' ], 1 ) m2,c2 = np.polyfit(df[ 'x2' ], df[ 'y2' ], 1 ) m3,c3 = np.polyfit(df[ 'x3' ], df[ 'y3' ], 1 ) m4,c4 = np.polyfit(df[ 'x4' ], df[ 'y4' ], 1 ) # Residual sum of squares error RSSY_1 = ((df[ 'y1' ] - (m1 * df[ 'x1' ] + c1)) * * 2 ). sum () RSSY_2 = ((df[ 'y2' ] - (m2 * df[ 'x2' ] + c2)) * * 2 ). sum () RSSY_3 = ((df[ 'y3' ] - (m3 * df[ 'x3' ] + c3)) * * 2 ). sum () RSSY_4 = ((df[ 'y4' ] - (m4 * df[ 'x4' ] + c4)) * * 2 ). sum () # Total sum of squares TSS_1 = ((df[ 'y1' ] - y1_mean) * * 2 ). sum () TSS_2 = ((df[ 'y2' ] - y2_mean) * * 2 ). sum () TSS_3 = ((df[ 'y3' ] - y3_mean) * * 2 ). sum () TSS_4 = ((df[ 'y4' ] - y4_mean) * * 2 ). sum () # R squared (coefficient of determination) R2_1 = 1 - (RSSY_1 / TSS_1) R2_2 = 1 - (RSSY_2 / TSS_2) R2_3 = 1 - (RSSY_3 / TSS_3) R2_4 = 1 - (RSSY_4 / TSS_4) # Create a pandas dataframe to represent the summary statistics summary_stats = pd.DataFrame({ 'Mean_x' : [x1_mean, x2_mean, x3_mean, x4_mean], 'Variance_x' : [x1_std * * 2 , x2_std * * 2 , x3_std * * 2 , x4_std * * 2 ], 'Mean_y' : [y1_mean, y2_mean, y3_mean, y4_mean], 'Variance_y' : [y1_std * * 2 , y2_std * * 2 , y3_std * * 2 , y4_std * * 2 ], 'Correlation' : [correlation_x1y1, correlation_x2y2, correlation_x3y3, correlation_x4y4], 'Linear Regression slope' : [m1, m2, m3, m4], 'Linear Regression intercept' : [c1, c2, c3, c4]}, index = [ 'I' , 'II' , 'III' , 'IV' ]) print (summary_stats.T) |
Output:
I II III IV Mean_x 9.000000 9.000000 9.000000 9.000000 Variance_x 11.000000 11.000000 11.000000 11.000000 Mean_y 7.500909 7.500909 7.500000 7.500909 Variance_y 4.127269 4.127629 4.122620 4.123249 Correlation 0.816421 0.816237 0.816287 0.816521 Linear Regression slope 0.500091 0.500000 0.499727 0.499909 Linear Regression intercept 3.000091 3.000909 3.002455 3.001727
Plot the scatter plot and linear regression line for each datasets
Python3
# plot all four plots fig, axs = plt.subplots( 2 , 2 , figsize = ( 18 , 12 ), dpi = 500 ) axs[ 0 , 0 ].set_title( 'Dataset I' , fontsize = 20 ) axs[ 0 , 0 ].set_xlabel( 'X' , fontsize = 13 ) axs[ 0 , 0 ].set_ylabel( 'Y' , fontsize = 13 ) axs[ 0 , 0 ].plot(df[ 'x1' ], df[ 'y1' ], 'go' ) axs[ 0 , 0 ].plot(df[ 'x1' ], m1 * df[ 'x1' ] + c1, 'r' ,label = 'Y=' + str ( round (m1, 2 )) + 'x +' + str ( round (c1, 2 ))) axs[ 0 , 0 ].legend(loc = 'best' ,fontsize = 16 ) axs[ 0 , 1 ].set_title( 'Dataset II' ,fontsize = 20 ) axs[ 0 , 1 ].set_xlabel( 'X' , fontsize = 13 ) axs[ 0 , 1 ].set_ylabel( 'Y' , fontsize = 13 ) axs[ 0 , 1 ].plot(df[ 'x2' ], df[ 'y2' ], 'go' ) axs[ 0 , 1 ].plot(df[ 'x2' ], m2 * df[ 'x2' ] + c2, 'r' ,label = 'Y=' + str ( round (m2, 2 )) + 'x +' + str ( round (c2, 2 ))) axs[ 0 , 1 ].legend(loc = 'best' ,fontsize = 16 ) axs[ 1 , 0 ].set_title( 'Dataset III' ,fontsize = 20 ) axs[ 1 , 0 ].set_xlabel( 'X' , fontsize = 13 ) axs[ 1 , 0 ].set_ylabel( 'Y' , fontsize = 13 ) axs[ 1 , 0 ].plot(df[ 'x3' ], df[ 'y3' ], 'go' ) axs[ 1 , 0 ].plot(df[ 'x3' ], m1 * df[ 'x3' ] + c1, 'r' ,label = 'Y=' + str ( round (m3, 2 )) + 'x +' + str ( round (c3, 2 ))) axs[ 1 , 0 ].legend(loc = 'best' ,fontsize = 16 ) axs[ 1 , 1 ].set_title( 'Dataset IV' ,fontsize = 20 ) axs[ 1 , 1 ].set_xlabel( 'X' , fontsize = 13 ) axs[ 1 , 1 ].set_ylabel( 'Y' , fontsize = 13 ) axs[ 1 , 1 ].plot(df[ 'x4' ], df[ 'y4' ], 'go' ) axs[ 1 , 1 ].plot(df[ 'x4' ], m4 * df[ 'x4' ] + c4, 'r' ,label = 'Y=' + str ( round (m4, 2 )) + 'x +' + str ( round (c4, 2 ))) axs[ 1 , 1 ].legend(loc = 'best' ,fontsize = 16 ) plt.show() |
Output:
Note: It is mentioned in the definition that Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed.
Explanation of this output:
- In the first one(top left) if you look at the scatter plot you will see that there seems to be a linear relationship between x and y.
- In the second one(top right) if you look at this figure you can conclude that there is a non-linear relationship between x and y.
- In the third one(bottom left) you can say when there is a perfect linear relationship for all the data points except one which seems to be an outlier which is indicated be far away from that line.
- Finally, the fourth one(bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient.
Application:
The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.