A standard deviation plot is used to check if there is a deviation between different groups of data. These groups can be generated manually or can be decided based on some property of the dataset.
Standard deviation plots can be formed of :
- Vertical Axis: Group Standard deviation
- Horizontal Axis: Group Identifier/ Label of the groups.
A reference straight line is plotted among the overall standard deviation.
The standard deviation plot is used to answer the following questions:
- Is there any shift in the variation?
- What is the magnitude of the shift in the variation?
- Is there any distinct pattern in the shift of the variation?
A standard deviation plot is generally used to measure the scale, the same scale measure can also be used to find with mean absolute plot and average deviation plot. These plots also provide better accuracy in terms of identifying outliers.
Uses of Standard Deviation plot
- A standard deviation plot is generally used to measure the scale, the same scale measure can also be found with mean absolute plot and average deviation plot. These plots also provide better accuracy in terms of identifying outliers.
- A common assumption in many analyses such as 1-factor analysis that the variance is the same for different levels of factor variables. A standard deviation plot can be used to verify that.
- We can also verify the constant variance assumptions of univariate data by dividing the data into equal size partitions and plotting variance for each of the partitions.
Implementation
- In this implementation, we use the Delhi weather dataset from Kaggle. The link to the dataset can be found here
Python3
# import necessary modules import matplotlib.pyplot as plt import pandas as pd import seaborn as sns sns.set_style( 'darkgrid' ) % matplotlib inline sns.mpl.rcParams[ 'figure.figsize' ] = ( 10.0 , 8.0 ) # read weather dataset df = pd.read_csv( 'weather.csv' ) # remove the hours and minutes from time to keep date only df[ 'datetime_utc' ] = pd.to_datetime(df[ 'datetime_utc' ]).dt.date df.head() # group by dataframe into months, calculate standard deviation, # and sort them in chronological order month_Df = df.groupby(df[ 'datetime_utc' ].dt.strftime( '%B' ))[ " _tempm" ].std() new_order = [ 'January' , 'February' , 'March' , 'April' , 'May' , 'June' , 'July' , 'August' , 'September' , 'October' , 'November' , 'December' ] month_Df = month_Df.reindex(new_order) month_Df # plot scatterplot of the standard deviation (standard deviation plot) graph = sns.scatterplot(y = month_Df.values, x = month_Df.index) graph.axhline(df[ " _tempm" ].std(), color = 'red' ) plt.show() |
datetime_utc _conds _dewptm _fog _hail _heatindexm _hum _precipm _pressurem _rain _snow _tempm _thunder _tornado _vism _wdird _wdire _wgustm _windchillm _wspdm 0 1996-11-01 Smoke 9.0 0 0 NaN 27.0 NaN 1010.0 0 0 30.0 0 0 5.0 280.0 West NaN NaN 7.4 1 1996-11-01 Smoke 10.0 0 0 NaN 32.0 NaN -9999.0 0 0 28.0 0 0 NaN 0.0 North NaN NaN NaN 2 1996-11-01 Smoke 11.0 0 0 NaN 44.0 NaN -9999.0 0 0 24.0 0 0 NaN 0.0 North NaN NaN NaN 3 1996-11-01 Smoke 10.0 0 0 NaN 41.0 NaN 1010.0 0 0 24.0 0 0 2.0 0.0 North NaN NaN NaN 4 1996-11-01 Smoke 11.0 0 0 NaN 47.0 NaN 1011.0 0 0 23.0 0 0 1.2 0.0 North NaN NaN 0.0
datetime_utc April 5.817769 August 2.928722 December 5.288852 February 5.404892 January 4.646874 July 3.394908 June 4.520245 March 5.905230 May 5.441476 November 5.556417 October 4.930381 September 3.437260 Name: _tempm, dtype: float64
Implementation