The missing values in a dataset can cause problems during the building of an estimator. Scikit Learn provides different ways to handle missing data, which include imputing missing values. Imputing involves filling in missing data with estimated values that are based on other available data in the dataset.
Related topic of concepts:
- Missing Data: Missing data will refer to the absence of data in a dataset. It can occur for serval reasons, such as human error, technical error, or data corruption.
- Imputation: Imputation can refer to the process of filling in missing values with help pattern estimated values based on available data.
- Scikit Learn: Scikit Learn is a popular machine learning library in Python language that provides various tools for machine learning, this include data preprocessing, feature selection, and model building.
- Estimator: In machine learning, an estimator is an algorithm or model that learns from the data and is used to make predictions on new data.
Steps needed:
The following steps are required for imputing missing values before building an estimator in Scikit Learn:
- Import the required libraries: first You need to import the required libraries, including Scikit Learn and NumPy.
- Load the dataset: Then load the dataset which contains missing values.
- Identify missing values: After that identify missing values in the dataset.
- Impute missing values: We use Scikit Learn’s imputer class to impute missing values in the dataset.
- Build the estimator: To build the estimator, we are using here the Linear regression algorithm.
Examples
Let’s consider an example of a dataset containing missing values. The following code imputes missing values in the dataset using Scikit Learn’s SimpleImputer class:
Python
# Import the required libraries from sklearn.impute import SimpleImputer import numpy as np # Load the dataset X = np.array([[ 1 , 2 , np.nan], [ 3 , np.nan, 4 ], [ 5 , 6 , np.nan], [ 7 , 8 , 9 ]]) Y = np.array([ 14 , 20 , 29 , 40 ]) # Identify missing values print ( 'Check Null values \n' ,np.isnan(X)) # Impute missing values imputer = SimpleImputer(strategy = 'mean' ) X_imputed = imputer.fit_transform(X) # Build the estimator from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_imputed, Y) print ( '\nCoefficient :' ,regressor.coef_) print ( 'Intercempt :' ,regressor.intercept_) # Prediction Y_pred = X_imputed @ regressor.coef_ + regressor.intercept_ print ( "Prediction :" ,Y_pred ) |
Output :
Check Null values [[False False True] [False True False] [False False True] [False False False]] Coefficient : [2.25 1.5 1.4 ] Intercempt : -0.3499999999999943 Prediction : [14. 20. 29. 40.]
In the above example, we first loaded a dataset which containing missing values. We then identified missing values in the following dataset using the NumPy library. We then used Scikit Learn’s SimpleImputer class to impute missing values in the dataset. Finally, we built a linear regression estimator using the imputed dataset.