Pipelines play a useful role in transforming and manipulating tons of data. Pipeline are a sequence of data processing mechanisms. Pandas pipeline feature allows us to string together various user-defined Python functions in order to build a pipeline of data processing. There are two ways to create a Pipeline in pandas. By calling .pipe() function and by importing pdpipe package.
Through pandas pipeline function i.e. pipe() function we can call more than one function at a time and in a single line for data processing. Let’s understand and create a pipeline by using the pipe() function.
Below are various examples that depict how to create a pipeline using pandas.
Example 1:
Python3
# importing pandas library import pandas as pd # Create empty dataframe df = pd.DataFrame() # Creating a simple dataframe df[ 'name' ] = [ 'Reema' , 'Shyam' , 'Jai' , 'Nimisha' , 'Rohit' , 'Riya' ] df[ 'gender' ] = [ 'Female' , 'Male' , 'Male' , 'Female' , 'Male' , 'Female' ] df[ 'age' ] = [ 31 , 32 , 19 , 23 , 28 , 33 ] # View dataframe df |
Output:
Now, creating functions for data processing.
Python3
# function to find mean def mean_age_by_group(dataframe, col): # groups the data by a column and # returns the mean age per group return dataframe.groupby(col).mean() # function to convert to uppercase def uppercase_column_name(dataframe): # Converts all the column names into uppercase dataframe.columns = dataframe.columns. str .upper() # And returns them return dataframe |
Now, creating a pipeline using .pipe() function.
Python3
# Create a pipeline that applies both the functions created above pipeline = df.pipe(mean_age_by_group, col = 'gender' ).pipe(uppercase_column_name) # calling pipeline pipeline |
Output:
Now, let’s understand and create a pipeline by importing pdpipe package.
The pdpipe Python package provides a concise interface for building pandas pipelines that have pre-conditions. The pdpipe is a pre-processing pipeline package for Python’s panda data frame. The pdpipe API helps to easily break down or compose complex-ed panda processing pipelines with few lines of codes.
We can install this package by simply writing:
pip install pdpipe
Example 2:
Python3
# importing the package import pdpipe as pdp import pandas as pd # creating a empty dataframe named dataset dataset = pd.DataFrame() # Creating a simple dataframe dataset[ 'name' ] = [ 'Reema' , 'Shyam' , 'Jai' , 'Nimisha' , 'Rohit' , 'Riya' ] dataset[ 'gender' ] = [ 'Female' , 'Male' , 'Male' , 'Female' , 'Male' , 'Female' ] dataset[ 'age' ] = [ 31 , 32 , 19 , 23 , 28 , 33 ] dataset[ 'department' ] = [ 'Accounts' , 'Management' , 'IT' , 'IT' , 'Management' , 'Advertising' ] dataset[ 'index' ] = [ 1 , 2 , 3 , 4 , 5 , 6 ] # View dataframe dataset |
Output:
Removing a column from dataframe using pdpipe.
Python3
# creating a pipeline and # dropping the unwanted column dropCol = pdp.ColDrop( "index" ). apply (dataset) # display the new dataframe # after column drop dropCol |
Output:
There is another way to drop columns through pdpipe.
Python3
# creating a pipeline and # dropping the unwanted column dropCol2 = pdp.ColDrop( "index" ) # applying the ColDrop to dataframe df2 = dropCol2(dataset) # display dataframe df2 |
Output:
Here, the column is dropped in two steps. In the first step, we created a pipeline and in the second step, we applied it to the dataframe.
Example 3:
Now we are adding one column to dataframe using pdpipe.
Python3
# importing the package import pdpipe as pdp import pandas as pd # creating a empty dataframe named dataset dataset = pd.DataFrame() # Creating a simple dataframe dataset[ 'name' ] = [ 'Reema' , 'Shyam' , 'Jai' , 'Nimisha' , 'Rohit' , 'Riya' ] dataset[ 'gender' ] = [ 'Female' , 'Male' , 'Male' , 'Female' , 'Male' , 'Female' ] dataset[ 'age' ] = [ 31 , 32 , 19 , 23 , 28 , 33 ] dataset[ 'department' ] = [ 'Accounts' , 'Management' , 'IT' , 'IT' , 'Management' , 'Advertising' ] dataset[ 'index' ] = [ 1 , 2 , 3 , 4 , 5 , 6 ] # View dataframe dataset |
Output:
Now, dropping the values from dataframe.
Python3
#dropping the values using ValDrop df3 = pdp.ValDrop([ 'IT' ], 'department' ). apply (dataset) #display dataframe df3 |
Output:
The row containing ‘ IT ‘ value is dropped.