Introduction
This beginner’s tutorial is about interpolation. Interpolation in Python is a technique used to estimate unknown data points between two known data points. In Python, Interpolation is a technique mostly used to impute missing values in the data frame or series while preprocessing data. You can use this method to estimate missing data points in your data using Python in Power BI or machine learning algorithms. Interpolation is also used in Image Processing when expanding an image, where you can estimate the pixel value with the help of neighboring pixels.
Learning Objectives
- In this tutorial on data science and machine learning, we will learn to handle missing data and preprocess data before using it in the machine learning model.
- We will also learn about handling missing data with python and python pandas library, i.e., pandas interpolate and scipy library.
This article was published as a part of the Data Science Blogathon.
Table of Contents
When to Use Interpolation?
We can use Interpolation to find missing value/null with the help of its neighbors. When imputing missing values with average does not fit best, we have to move to a different technique, and the technique most people find is Interpolation.
Interpolation is mostly used while working with time-series data because, in time-series data, we like to fill missing values with the previous one or two values. for example, suppose temperature, now we would always prefer to fill today’s temperature with the mean of the last 2 days, not with the mean of the month. We can also use Interpolation for calculating the moving averages.
Using Interpolation to Fill Missing Values in Series Data
Pandas series is a one-dimensional array that is capable of storing elements of various data types like lists. We can easily create a series with the help of a list, tuple, or dictionary. To perform all Interpolation methods we will create a pandas series with some NaN values and try to fill missing values with some interpolated values by the implementation of the interpolate methods or some other different methods of Interpolation.
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7])
Linear Interpolation
Linear Interpolation simply means to estimate a missing value by connecting dots in a straight line in increasing order. In short, It estimates the unknown value in the same increasing order from previous values. The default method used by Interpolation is Linear. So while applying it, we need not specify it.
The output you can observe as
|
Hence, Linear interpolation works in the same order. Remember that it does not interpret using the index; it interprets values by connecting points in a straight line.
Polynomial Interpolation
In Polynomial Interpolation, you need to specify an order. It means that polynomial interpolation fills missing values with the lowest possible degree that passes through available data points. The polynomial Interpolation curve is like the trigonometric sin curve or assumes it like a parabola shape.
a.interpolate(method="polynomial", order=2)
If you pass an order as 1, then the output will be similar to linear because the polynomial of order 1 is linear.
Interpolation Through Padding
Interpolation with the help of padding simply means filling missing values with the same value present above them in the dataset. If the missing value is in the first row, then this method will not work. While using this technique, you also need to specify the limit, which means how many NaN values to fill.
So, if you are working on a real-world project and want to fill missing values with previous values, you have to specify the limit as to the number of rows in the dataset.
a.interpolate(method="pad", limit=2)
You will see the output coming as below.
|
The missing data is replaced by the same value as present before to it.
Using Interpolation to Fill Missing Values in Pandas DataFrame
DataFrame is a widely used python data structure that stores the data in the form of rows and columns. When performing data analysis we always store the data in a table which is known as a data frame. The dropna() function is generally used to drop all the null values in a dataframe. A data frame can contain huge missing values in many columns, so let us understand how we can use Interpolation to fill in missing values in the data frame.
(Note: To save changes, you can use inplace = True in python )
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
Linear Interpolation in the Forwarding Direction
The linear method ignores the index and treats missing values as equally spaced, and finds the best point to fit the missing value after previous points. If the missing value is at the first index, then it will leave it as Nan. let’s apply dataframe.interpolate to our data frame.
df.interpolate(method ='linear', limit_direction ='forward')
the output you can observe in the below figure.
If you only want to perform interpolation in a single column, then it is also simple and follows the below code.
df['C'].interpolate(method="linear")
Linear Interpolation in Backward Direction (bfill)
Now, the method is the same, only the order in which we want to perform changes. Now the method will work from the end of the data frame or understand it as a bottom-to-top approach.
df.interpolate(method ='linear', limit_direction ='backward')
You will get the same output as in the below figure.
Interpolation With Padding
We have already seen that to use padding, we have to specify the limit of NaN values to be filled. we have a maximum of 2 NaN values in the data frame, so our limit will be 2.
df.interpolate(method="pad", limit=2)
After running the above code, it will fill missing values with previous and present values and give the output, as shown in the figure below.
Filling Missing Values in Time-Series Data
Time-series(datetime) data is data that follows some special trend or seasonality. It makes sense to use the interpolation of the variable before and after a timestamp for a missing value. Analyzing Time series data is a little bit different than normal data frames. Whenever we have time-series data, Then to deal with missing values, we cannot use mean imputation techniques. Interpolation is a powerful method to fill in missing values in time-series data.
df = pd.DataFrame({'Date': pd.date_range(start='2021-07-01', periods=10, freq='H'), 'Value':range(10)})
df.loc[2:3, 'Value'] = np.nan
Syntax for Filling Missing Values in Forwarding and Backward Methods
The simplest method to fill values using interpolation is the same as we apply on a column of the dataframe.
df['value'].interpolate(method="linear")
But the method is not used when we have a date column because we will fill in missing values according to the date, which makes sense while filling in missing values in time series data.
df.set_index('Date')['Value'].interpolate(method="linear")
The same code with a few modifications can be used as a backfill to fill missing values in the backward direction.
df.set_index('Date')['Value'].fillna(method="backfill", axis=None)
Conclusion
We have learned various methods to use interpolate function in Python to fill in missing values in series as well as in dataframe. It is very important for data scientists and analysts to know how to use the interpolate function, as handling missing values is a crucial part of their everyday job. Interpolation, in most cases supposed to be the best technique to fill in missing values. I hope you now know the power of interpolation and understand how to use it.
Key Takeaways
- We can read excel and CSV files and can use interpolate function.
- We can fill in missing values in both forward and backward directions.
Frequently Asked Questions
Q1. What is the interpolation method for missing data?
A. There are multiple methods to interpolate missing data, like linear and polynomial interpolation.
Q2. How do you fill in missing values in a time series in Python?
A. We can impute missing values in a time series data by filling them with either the last or the next observed values.
Q3. What are the advantages of interpolation in Python?
A. Interpolation is a process of determining the unknown values that lie in between the known data points. It is mostly used to predict the unknown values data points.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.