A Quantile is where a sample is divided into equal-sized, adjacent, subgroups.
The median is a quantile; the median is placed in a probability distribution so that exactly half of the data is lower than the median and half of the data is above the median. The median cuts a distribution into two equal areas and so it is sometimes called 2-quantile.
Quartiles are also quantiles; they divide the distribution into four equal parts.
Percentiles are quantiles that divide a distribution into 100 equal parts and deciles are quantiles that divide a distribution into 10 equal parts.
We can use the following formula to estimate the ith observation:
ith observation = q (n + 1)
where q is the quantile, the proportion below the ith value that you are looking for
n is the number of items in a data set.
So for finding Quantile rank, q should be 0.25 as we want to divide our data set into 4 equal parts and rank the values from 0-3 based on which quartile they fall upon.
And similarly for Decile rank, q should be 0.1 as we want our data set to be divided into 10 equal parts.
Before moving to Pandas, lets us try the above concept on an example to understand how our Quantile and Decile Ranks are calculated.
Sample question : Find the number in the following set of data where 25 percent of values fall below it, and 75 percent fall above.
Data : 32, 47, 55, 62, 74, 77, 86
Step 1: Order the data from smallest to largest. The data in the question is already in ascending order.
Step 2: Count how many observations you have in your data set. this particular data set has 7 items.
Step 3: Convert any percentage to a decimal for “q”. We are looking for the number where 25 percent of the values fall below it, so convert that to .25.
Step 4: Insert your values into the formula:
Answer:
ith observation = q (n + 1)
ith observation = .25(7 + 1) = 2
The ith observation is at 2. The 2nd number in the set is 47, which is the number where 25 percent of the values fall below it. And then we can start ranking our numbers from 0-3 since we are finding Quantile Rank. Similar approach for finding Decile Rank, in this case its just that the value of q will be 0.1.
Now let us now look in Pandas how we can quickly achieve the same.
Code for Creating a DataFrame:
python3
# Import pandas import pandas as pd # Create a DataFrame df1 = { 'Name' :[ 'George' , 'Andrea' , 'John' , 'Helen' , 'Ravi' , 'Julia' , 'Justin' ], 'EnglishScore' :[ 62 , 47 , 55 , 74 , 32 , 77 , 86 ]} df1 = pd.DataFrame(df1, columns = [ 'Name' , '']) # Sorting the DataFrame in Ascending Order of English Score df1.sort_values(by = [ 'EnglishScore' ], inplace = True ) |
If we print the above dataframe we get the below result :
Now we can find the Quantile Rank using the pandas function qcut() by passing the column name which is to be considered for the Rank, the value for parameter q which signifies the Number of quantiles. 10 for deciles, 4 for quartiles, etc. and labels = False to return the bins as Integers.
Following is code for Quantile Rank
python3
# code df1[ 'QuantileRank' ] = pd.qcut(df1[ 'EnglishScore' ], q = 4 , labels = False ) |
And now if we print the dataframe we can see the new column QauntileRank ranking our data based on EnglishScore column.
Similarly to calculate Decile Rank we set q = 10
python3
# code df1[ 'DecileRank' ] = pd.qcut(df1[ 'EnglishScore' ], q = 10 , labels = False ) |
Now if we print our DataFrame we get the following output.
That is how we can use the Pandas qcut() method to calculate the various Quantiles on a column.
The entire code for the above example is given below.
python3
# code import pandas as pd # Create a DataFrame df1 = { 'Name' :[ 'George' , 'Andrea' , 'John' , 'Helen' , 'Ravi' , 'Julia' , 'Justin' ], 'EnglishScore' :[ 62 , 47 , 55 , 74 , 32 , 77 , 86 ]} df1 = pd.DataFrame(df1, columns = [ 'Name' , 'EnglishScore' ]) # Sorting the DataFrame in Ascending Order of English Score # Sorting just for the purpose of better data readability. df1.sort_values(by = [ 'EnglishScore' ], inplace = True ) # Calculating Quantile Rank df1[ 'QuantileRank' ] = pd.qcut(df1[ 'EnglishScore' ], q = 4 , labels = False ) # Calculating Decile Rank df1[ 'DecileRank' ] = pd.qcut(df1[ 'EnglishScore' ], q = 10 , labels = False ) # printing the dataframe print (df1) |