This article was published as a part of the Data Science Blogathon.
Hello, There Data science has been a vastly growing and improving field. New algorithms new models are being created more frequently than ever before. However, the model implementations and the machine learning algorithm comes into play after Data Preprocessing. Below is just a sneak peek of some important things.
The first question arises is what is Data Preprocessing? Keeping it in layman terms Before grinding the wheat, we clean it, after that, we wash it so that all the garbage is removed, Data Preprocessing also does the same with data. It cleans the data and removes the garbage so that in the end only Quality data is received.
That’s why this is a very important step when creating a model.
Below are the 5 things to keep in mind during Data Preprocessing:
1. KNOW YOUR DATA
It is a good habit to get some information about the data or basically about the parameters present in it. Either by looking up about the parameters on the internet or talking to the person expert in that field for example building a predictive model for forecasting we can talk to a chemistry expert which can tell us some information about the parameters present in it and what are they how do they behave this step makes many things clear about the dataset. And let us understand what is going on in the dataset.
In a Predictive model, we are generally trying to predict a parameter. We do this by finding out the effect of other parameters present in the dataset. The variable which is targeted is called Dependent Variable and the rest variable is called the Independent Variable. For example, suppose we are building a Model that predicts ozone meaning it tells us the concentration of Ozone for the next 24 hours using the previously recorded data. Now Ozone is our Dependent variable and rest parameters are our independent variables.
2. IDENTIFY THE VARIABLE/PARAMETERS
There are two types of variable :
- Categorical Variables
- Continuous Variables
Categorical Variables
Categorical variables are variables that generally define two or more categories for example gender is a categorical variable as it two categories male or female. Categorical variables are treated differently than Continuous variables.
Continuous Variables
Continuous variables are variables that can take any continuous value depending upon the parameter for example Weight, height, age are continuous variables.
Once we have identified these two types of variables then we can proceed further accordingly.
For example categorical variable we can use OneHotEncoder which basically convert the categorical variable into a continuous variable so that it can be fed to the model. For example, gender contains Male and Females so OneHotEncoder will convert Male to 0 and Female to 1.
3. FINDING CORRELATION BETWEEN PARAMETERS
Finding Correlation values between parameters (Which is generally between Independent and Dependent variables) is a great way to find out the relations between parameters. The correlation value lies between -1 to 1. -1 and 1 being the best value which means that if a correlation value between parameter near -1 or 1 then those parameters are strongly correlated.
If it is 1 then those both parameter is directly proportional to each other and If it is -1 then those both parameters are inversely proportional.
f those parameters value lies near 0 then they are weakly correlated which means that they both don’t depend upon each other.
From this analysis, we can remove those parameters which do not affect our independent variable. For example in the prediction model of Ozone. Carbon monoxide and Ozone have a correlation value of -0.6 and wind speed and Ozone have a correlation value of -0.1. From this, we can remove the wind speed parameter from our dataset, and that way we can increase our accuracy of the model which otherwise would have been compromised if we have included the wind speed parameter in our dataset.
4. TREATING MISSING VALUES AND EXCEPTION RIGHT
All the data is recorded so it is very obvious that there can be some errors while recording the data. There might be some missing values some NaN values. Some absurd values for example in the human age parameter there is a value of 200 which is not possible in practical life so this is a mistake in the dataset. And all these mistakes affect our dataset integrity so they need to clarify before inputting the dataset into the model.
For treating missing values we can do a couple of things depending upon the dataset. We can replace the missing values by the mean of that parameter or by the mode of that parameter. Generally, we replace it with the mean of the parameter.
Sometimes we need to remove that row from the dataset where the missing value is present. Either we delete the entire column or delete the entire row which expands to other columns.
5. SCALARIZING THE DATA
Scalarization depends upon the type of model we are using however scalarization is a very important concept in Data Preprocessing. Suppose we have a dataset that contains parameters A, B, C, D.
Parameter A contains values between 0-10
Parameter B contains values between 10-1000
Parameter C contains values between 1000-10,000
Parameter D contains values between 0-1
So we see that the values of the parameters range very much and one parameter might totally over dominate the others so to reduce this and to increase the accuracy and for creating a balanced dataset where all the parameters are Standardized meaning lying in the same range. We scalarize the values
The most commonly used Scalar in python is the Standard scalar which scalarizes all the values between the range -3 to +3.
Hola! That’s the end of it. Hope everyone learned something from it. It was just a small a very brief tip about Data Preprocessing.
This was a blog written for the competition of Blogathon which is happening right now. And actually my first ever blog. Right now a college student in the final of Msc aspiring to become a data scientist. Here is my LinkedIn profile https://www.linkedin.com/in/harsh-kumar-jangir-5545a0174/
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.