Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier.
Pandas DataFrame describe()
Pandas describe() is used to view some basic statistical details like percentile, mean, std, etc. of a data frame or a series of numeric values. When this method is applied to a series of strings, it returns a different output which is shown in the examples below.
Syntax: DataFrame.describe(percentiles=None, include=None, exclude=None)
Parameters:
- percentile: list like data type of numbers between 0-1 to return the respective percentile
- include: List of data types to be included while describing dataframe. Default is None
- exclude: List of data types to be Excluded while describing dataframe. Default is None
Return type: Statistical summary of data frame.
Creating DataFrame for demonstration:
To download the data set used in the following example, click here. In the following examples, the data frame used contains data from some NBA players. Let’s have a look at the data by importing it.
Python3
import pandas as pd # reading and printing csv file data = pd.read_csv( 'nba.csv' ) print (data.head()) |
Output:
Name Team Number Position Age Height Weight College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0 Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0 Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0 NaN 5000000.0
Using Describe function in Pandas
We can easily learn about several statistical measures, including mean, median, standard deviation, quartiles, and more, by using describe() on a DataFrame.
Python3
print (data.descibe()) |
Number Age Weight Salary
count 457.000000 457.000000 457.000000 4.460000e+02
mean 17.678337 26.938731 221.522976 4.842684e+06
std 15.966090 4.404016 26.368343 5.229238e+06
min 0.000000 19.000000 161.000000 3.088800e+04
25% 5.000000 24.000000 200.000000 1.044792e+06
50% 13.000000 26.000000 220.000000 2.839073e+06
75% 25.000000 30.000000 240.000000 6.500000e+06
max 99.000000 40.000000 307.000000 2.500000e+07
Explanation of the description of numerical columns:
count: Total Number of Non-Empty values
mean: Mean of the column values
std: Standard Deviation of the column values
min: Minimum value from the column
25%: 25 percentile
50%: 50 percentile
75%: 75 percentile
max: Maximum value from the column
Pandas describe() behavior for numeric dtypes
In this example, the data frame is described and [‘object’] is passed to include a parameter to see a description of the object series. [.20, .40, .60, .80] is passed to the percentile parameter to view the respective percentile of the Numeric series.
Python3
import pandas as pd data = pd.read_csv( 'nba.csv' ) # removing null values to avoid errors data.dropna(inplace = True ) # percentile list perc = [. 20 , . 40 , . 60 , . 80 ] # list of dtypes to include include = [ 'object' , 'float' , 'int' ] # calling describe method desc = data.describe(percentiles = perc, include = include) # display desc |
Output:
Name Team Number Position Age \
count 364 364 364.000000 364 364.000000
unique 364 30 NaN 5 NaN
top Avery Bradley New Orleans Pelicans NaN SG NaN
freq 1 16 NaN 87 NaN
mean NaN NaN 16.829670 NaN 26.615385
std NaN NaN 14.994162 NaN 4.233591
min NaN NaN 0.000000 NaN 19.000000
20% NaN NaN 4.000000 NaN 23.000000
40% NaN NaN 9.000000 NaN 25.000000
50% NaN NaN 12.000000 NaN 26.000000
60% NaN NaN 17.000000 NaN 27.000000
80% NaN NaN 30.000000 NaN 30.000000
max NaN NaN 99.000000 NaN 40.000000
Height Weight College Salary
count 364 364.000000 364 3.640000e+02
unique 17 NaN 115 NaN
top 6-9 NaN Kentucky NaN
freq 49 NaN 22 NaN
mean NaN 219.785714 NaN 4.620311e+06
std NaN 24.793099 NaN 5.119716e+06
min NaN 161.000000 NaN 5.572200e+04
20% NaN 195.000000 NaN 9.472760e+05
40% NaN 212.000000 NaN 1.638754e+06
50% NaN 220.000000 NaN 2.515440e+06
60% NaN 228.000000 NaN 3.429934e+06
80% NaN 242.400000 NaN 7.838202e+06
max NaN 279.000000 NaN 2.287500e+07
As shown in the output image, the Statistical description of the Dataframe was returned with the respectively passed percentiles. For the columns with strings, NaN was returned for numeric operations.
Describing series of strings
In this example, the described method is called by the Name column to see the behavior with the object data type.
Python3
# importing pandas module import pandas as pd # making data frame data = pd.read_csv( "nba.csv" ) # removing null values to avoid errors data.dropna(inplace = True ) # calling describe method desc = data[ "Name" ].describe() # display desc |
Output: As shown in the output image, the behavior of describe() is different with a series of strings. Different stats were returned like count of values, unique values, top, and frequency of occurrence in this case.
count 457
unique 457
top Avery Bradley
freq 1
Name: Name, dtype: object