This article was published as a part of the Data Science Blogathon
Introduction
Feature Selection is the process of selecting the features which are relevant to a machine learning model. It means that you select only those attributes that have a significant effect on the model’s output.
Consider the case when you go to the departmental store to buy grocery items. A product has a lot of information on it, i.e., product, category, expiry date, MRP, ingredients, and manufacturing details. All this information is the features of the product. Normally, you check the brand, MRP, and expiry date before buying a product. However, the ingredient and manufacturing section is not your concern. Therefore, brand, MRP, expiry date are relevant features, and the ingredient, manufacturing details are irrelevant. This is how feature selection is done.
In the real world, a dataset can have thousand of features and there may be chances some features may be redundant, some may be correlated and some may be irrelevant for the model. In this scenario, if you use all the features, it will take a lot of time to train the model, and model accuracy will be reduced. Therefore, feature selection becomes important in model building. There are many other ways of feature selection such as recursive feature elimination, genetic algorithms, decision trees. However, I will tell you the most basic and manual method of filtering using statistical tests.
Now, you have a basic understanding of feature selection, we will see how to implement various statistical tests on the data to select important features.
Objective
The main objective of this blog is to understand the statistical tests and their implementation on real data in Python which will help in feature selection.
Terminologies
Before going into the types of statistical tests and their implementation, it is necessary to understand the meanings of some terminologies.
Hypothesis Testing
Hypothesis Testing in statistics is a method to test the results of experiments or surveys to see if you have meaningful results. It is useful when you want to infer about a population based on a sample or correlation between two or more samples.
Null Hypothesis
This hypothesis states that there is no significant difference between sample and population or among different populations. It is denoted by H0.
Ex – We assume that the mean of 2 samples is equal.
Alternate Hypothesis
The statement contrary to the null hypothesis comes under the alternate hypothesis. It is denoted by H1.
Ex – We assume that the mean of the 2 samples is unequal.
Critical Value
It is a point on the scale of the test statistic beyond which the null hypothesis is rejected. Higher the critical value, lower the probability of 2 samples belonging to the same distribution. The critical value for any test can
p-value
p-value stands for ‘probability value’; it tells how likely it is that a result occurred by chance alone. Basically, the p-value is used in hypothesis testing to help you support or reject the null hypothesis. The smaller the p-value, the stronger the evidence to reject the null hypothesis.
Degree of freedom
The degree of freedom is the number of independent variables. This concept is used in calculating t statistic and chi-square statistic.
You may refer to statisticswho.com for more information regarding these terminologies.
Statistical Tests
A statistical test is a way to determine whether the random variable is following the null hypothesis or alternate hypothesis. It basically tells whether the sample and population or two/ more samples have significant differences. You can use various descriptive stats such as mean, median, mode, range, or standard deviation for this purpose. However, we generally use the mean. The statistical test gives you a number which is then compared with the p-value. If its value is more than the p-value you accept the null hypothesis, else you reject it.
The procedure for implementing each statistical test will be as follows:
- We calculate the statistic value using the mathematical formula
- We then calculate the critical value using statistic tables
- With the help of critical value, we calculate the p-value
- If p-value> 0.05 we accept the null hypothesis else we reject it
Now you have an understanding of feature selection and statistical tests, we can move towards the implementation of various statistical tests along with their meaning. Before that, I will show you the dataset and this dataset will be used to perform all tests.
Dataset
The dataset which I will be using is a loan prediction dataset which is taken from the analytics Vidhya contest. You can also participate in the contest and download the dataset here.
First I imported all necessary python modules and the dataset.
There are many features in the dataset such as Gender, Dependents, Education, Applicant Income, Loan Amount, Credit history. We will be using these features and check whether one feature effect affects other features using several tests i.e Z-Test, correlation test, ANOVA test, and Chi-square test.
Z-Test
A Z-test is used to compare the mean of two given samples and infer whether they are from the same distribution or not. We do not implement Z-test when the sample size is less than 30.
A Z-Test may be a one-sample Z test or a two-sample Z test.
The One-Sample t-Test determines whether the sample mean is statistically different from a known or hypothesized population mean. The two-sample Z-test compares 2 independent variables.
We will implement a two-sample Z test.
Z statistic is denoted by
Implementation
Please note that we will implement 2 sample z-test where one variable will be categorical with two categories and the other variable will be continuous to apply the z-test.
Here we will be using the Gender categorial variable and ApplicantIncome continuous variable. Gender has 2 groups: male and female. Therefore the hypothesis will be:
Null Hypothesis: There is no significant difference between the mean Income of males and females.
Alternate Hypothesis: There is a significant difference between the mean Income of males and females.
Code
M_mean=df.loc[df['Gender']=='Male','ApplicantIncome'].mean() F_mean=df.loc[df['Gender']=='Female','ApplicantIncome'].mean() M_std=df.loc[df['Gender']=='Male','ApplicantIncome'].std() F_std=df.loc[df['Gender']=='Female','ApplicantIncome'].std() no_of_M=df.loc[df['Gender']=='Male','ApplicantIncome'].count() no_of_F=df.loc[df['Gender']=='Female','ApplicantIncome'].count()
The above code is calculating the mean of males applicant income, mean of females applicant income, their standard deviation, and number of samples of males and females
twoSampZ function will calculate the z statistic and p-value bypassing the input parameters calculated above.
def twoSampZ(X1, X2, mudiff, sd1, sd2, n1, n2): pooledSE = sqrt(sd1**2/n1 + sd2**2/n2) z = ((X1 - X2) - mudiff)/pooledSE pval = 2*(1 - norm.cdf(abs(z))) return round(z,3), pval z,p= twoSampZ(M_mean,F_mean,0,M_std,F_std,no_of_M,no_of_F) print('Z'= z,'p'= p)
Z = 1.828
p = 0.06759726635832197
if p<0.05: print("we reject null hypothesis") else: print("we accept null hypothesis")
we accept the null hypothesis
Since value p is greater than 0.5 we accept the null hypothesis. Therefore, we conclude that there is no significant difference between the income of males and females.
T-Test
A t-test is also used to compare the mean of two given samples like the Z-test. However, It is implemented when the sample size is less than 30. It assumes a normal distribution of the sample. It can also be one-sample or two-sample. The degree of freedom is calculated by n-1 where n is the number of samples.
It is denoted by
Implementation
It will be implemented the same as Z-test. The only condition is sample size should be less than 30. I have shown you Z- Test implementation. Now, you can try your hands on the T-Test.
Correlation Test
A correlation test is a metric to evaluate the extent to which variables are associated with one another.
Please note that the variables must be continuous to apply the correlation test.
There are several methods for correlation tests i.e. Covariance, Pearson correlation coefficient, Spearman rank correlation coefficient, etc.
We will use the person correlation coefficient since it is independent of the values of variables.
Pearson Correlation Coefficient
It is used to measure the linear correlation between 2 variables. It is denoted by
Google Image
Its values lie between -1 and 1.
If the value of r is 0, it means there is no relationship between variables X and Y.
If the value of r is between 0 and 1, it means there is a positive relation between X and Y, and their strength increases from 0 to 1. Positive relation means if the value of X increases, the value of Y also increases.
If the value of r is between -1 and 0, it means there is a negative relation between X and Y, and their strength decreases from -1 to 0. Negative relation means if the value of X increases, the value of Y decreases.
Implementation
Here we will be using two continuous variables or features – Loan Amount and Applicant Income. We will conclude whether there is a linear relation between Loan Amount and Applicant Income with the Pearson correlation Coefficient value and also draw the chart between them.
Code
There are some missing values in the LoanAmount column, first, we filled it with the mean value. Then calculated correlation coefficient value.
[[1. 0.56562046] [0.56562046 1. ]]
The values on the diagonals indicate the correlation of features with themselves. 0.56 represent that there is some correlation between the two features.
We can also draw the chart as follows:
sns.lineplot(data=df,x='LoanAmount',y='ApplicantIncome')
ANOVA Test
ANOVA stands for Analysis of variance. As the name, suggests it uses variance as its parameter to compare multiple independent groups. ANOVA can be one-way ANOVA or two-way ANOVA. One-way ANOVA is applied when there are three or more independent groups of a variable. We will implement the same in python.
F-Statistic can be calculated by
Implementation
Here we will be using the Dependents categorial variable and ApplicantIncome continuous variable. Dependents has 4 groups: 0,1,2,3+. Therefore the hypothesis will be:
Null Hypothesis: There is no significant difference between the mean Income among different groups of dependents.
Alternate Hypothesis: There is a significant difference between the mean Income among different groups of dependents.
Code
First, we handled the missing values in the Dependents feature.
df['Dependents'].isnull().sum() df['Dependents']=df['Dependents'].fillna('0')
After this, we created a data frame with the features Dependents and ApplicantIncome. Then with the help of scipy.stats library we calculated the F statistic and p-value.
df_anova = df[['total_bill','day']] grps = pd.unique(df.day.values) d_data = {grp:df_anova['total_bill'][df_anova.day == grp] for grp in grps} F, p = stats.f_oneway(d_data['Sun'], d_data['Sat'], d_data['Thur'],d_data['Fri'])
print('F ={},p={}'.format(F,p))
F =5.955112389949444,p=0.0005260114222572804
Reject null hypothesis.
Since value p is less than 0.5 we reject the null hypothesis. Therefore, we conclude that there is a significant difference between the income of several groups of Dependents.
Chi-Square Test
This test is applied when you have two categorical variables from a population. It is used to determine whether there is a significant association or relationship between the two variables.
There are 2 types of chi-square tests: chi-square goodness of fit and chi-square test for independence, we will implement the latter one.
The degree of freedom in the chi-square test is calculated by (n-1)*(m-1) where n and m are numbers of rows and columns respectively.
It is denoted by:
Implementation
We will be using two categorical features Gender and Loan Status and find whether there is an association between them using the chi-square test.
Null Hypothesis: There is no significant association between Gender and Loan Status features.
Alternate Hypothesis: There is a significant association between Gender and Loan Status features.
Code
First, we retrieve the Gender and LoanStatus column and form a matrix.
dataset_table=pd.crosstab(dataset['sex'],dataset['smoker']) dataset_table
Loan_Status N Y Gender Female 37 75 Male 33 339
Then, we calculate observed and expected values using the above table.
observed=dataset_table.values val2=stats.chi2_contingency(dataset_table) expected=val2[3]
Then we calculate the chi-square statistic and p-value using the following code:
from scipy.stats import chi2 chi_square=sum([(o-e)**2./e for o,e in zip(observed,expected)]) chi_square_statistic=chi_square[0]+chi_square[1] p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print("chi-square statistic:-",chi_square_statistic) print('Significance level: ',alpha) print('Degree of Freedom: ',ddof) print('p-value:',p_value)
chi-square statistic:- 0.23697508750826923 Significance level: 0.05 Degree of Freedom: 1 p-value: 0.6263994534115932
if p_value<=alpha: print("Reject Null Hypothesis") else: print("Accept Null Hypthesis")
Accept Null Hypthesis
Since the p-value is greater than 0.05, we accept the null hypothesis. We conclude that there is no significant association between the two features.
Summary
First, we have discussed feature selection. Then we moved to statistical tests and various terminologies related to it. Lastly, we have seen the application of statistical tests i.e, Z-test, T-test, correlation test, ANOVA test, and Chi-square along with their implementation in python.
References
Featured Image – Google Image
Statistics – statisticswho.com
About Me
Hi! I am Ashish Choudhary. I am pursuing B.Tech from the JC Bose University of Science & Technology. Data Science is my passion and feels proud to write interesting blogs related to it. Feel free to contact me on LinkedIn.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.