Create a Pipeline in Pandas

27 July 2024

1

Pipelines play a useful role in transforming and manipulating tons of data. Pipeline are a sequence of data processing mechanisms. Pandas pipeline feature allows us to string together various user-defined Python functions in order to build a pipeline of data processing. There are two ways to create a Pipeline in pandas. By calling .pipe() function and by importing pdpipe package.

Through pandas pipeline function i.e. pipe() function we can call more than one function at a time and in a single line for data processing. Let’s understand and create a pipeline by using the pipe() function.

Below are various examples that depict how to create a pipeline using pandas.

Example 1:

Python3

# importing pandas library
import pandas as pd
 
# Create empty dataframe
df = pd.DataFrame()
 
# Creating a simple dataframe
df['name'] = ['Reema', 'Shyam', 'Jai', 
              'Nimisha', 'Rohit', 'Riya']
df['gender'] = ['Female', 'Male', 'Male', 
                'Female', 'Male', 'Female']
df['age'] = [31, 32, 19, 23, 28, 33]
 
# View dataframe
df

Output:

Now, creating functions for data processing.

Python3

# function to find mean
def mean_age_by_group(dataframe, col):
   
    # groups the data by a column and 
    # returns the mean age per group
    return dataframe.groupby(col).mean()
   
# function to convert to uppercase
def uppercase_column_name(dataframe):
   
    # Converts all the column names into uppercase
    dataframe.columns = dataframe.columns.str.upper()
     
    # And returns them
    return dataframe  

Now, creating a pipeline using .pipe() function.

Python3

# Create a pipeline that applies both the functions created above
pipeline = df.pipe(mean_age_by_group, col='gender').pipe(uppercase_column_name)
 
# calling pipeline
pipeline

Output:

Now, let’s understand and create a pipeline by importing pdpipe package.

The pdpipe Python package provides a concise interface for building pandas pipelines that have pre-conditions. The pdpipe is a pre-processing pipeline package for Python’s panda data frame. The pdpipe API helps to easily break down or compose complex-ed panda processing pipelines with few lines of codes.

We can install this package by simply writing:

pip install pdpipe

Example 2:

Python3

# importing the package
import pdpipe as pdp
import pandas as pd
 
# creating a empty dataframe named dataset
dataset = pd.DataFrame()
 
# Creating a simple dataframe
dataset['name'] = ['Reema', 'Shyam', 'Jai',
                   'Nimisha', 'Rohit', 'Riya']
 
dataset['gender'] = ['Female', 'Male', 'Male',
                     'Female', 'Male', 'Female']
 
dataset['age'] = [31, 32, 19, 23, 28, 33]
 
dataset['department'] = ['Accounts', 'Management',
                         'IT', 'IT', 'Management',
                         'Advertising']
 
dataset['index'] = [1, 2, 3, 4, 5, 6]
 
# View dataframe
dataset

Output:

Removing a column from dataframe using pdpipe.

Python3

# creating a pipeline and 
# dropping the unwanted column
dropCol = pdp.ColDrop("index").apply(dataset)
 
# display the new dataframe 
# after column drop
dropCol

Output:

There is another way to drop columns through pdpipe.

Python3

# creating a pipeline and 
# dropping the unwanted column
dropCol2 = pdp.ColDrop("index")
 
# applying the ColDrop to dataframe
df2 = dropCol2(dataset)
 
# display dataframe
df2

Output:

Here, the column is dropped in two steps. In the first step, we created a pipeline and in the second step, we applied it to the dataframe.

Example 3:

Now we are adding one column to dataframe using pdpipe.

Python3

# importing the package
import pdpipe as pdp
import pandas as pd
 
# creating a empty dataframe named dataset
dataset = pd.DataFrame()
 
# Creating a simple dataframe
dataset['name'] = ['Reema', 'Shyam', 'Jai',
                   'Nimisha', 'Rohit', 'Riya']
 
dataset['gender'] = ['Female', 'Male', 'Male',
                     'Female', 'Male', 'Female']
 
dataset['age'] = [31, 32, 19, 23, 28, 33]
 
dataset['department'] = ['Accounts', 'Management',
                         'IT', 'IT', 'Management',
                         'Advertising']
 
dataset['index'] = [1, 2, 3, 4, 5, 6]
 
# View dataframe
dataset

Output:

Now, dropping the values from dataframe.

Python3

#dropping the values using ValDrop
df3 = pdp.ValDrop(['IT'],'department').apply(dataset)
 
#display dataframe
df3

Output:

The row containing ‘ IT ‘ value is dropped.

Create a Pipeline in Pandas

Python3

Python3

Python3

Python3

Python3

Python3

Example 3:

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

How To Install PHP 8.2 on Ubuntu 22.04|20.04|18.04

Recent Comments

EDITOR PICKS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR POSTS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR CATEGORY

ABOUT US

FOLLOW US