Saturday, November 16, 2024
Google search engine
HomeLanguagesPandas – Strip whitespace from Entire DataFrame

Pandas – Strip whitespace from Entire DataFrame

“We can have data without information, but we cannot have information without data.”  How beautiful this quote is. Data is backbone of Data Scientist and according to a survey data scientist spends approx 60% of time in Cleaning and Organizing Data, so it’s our responsibility to make us familiar with different techniques to organize the data in a better way. In this article, we will learn about different methods to remove extra strip whitespace from the entire DataFrame. The dataset used here is given below:

In the above figure, we are observing that inside Name, Age, Blood Group, and Gender columns, data is in an irregular manner. In most of the cells of a particular column, extra whitespace are present in the leading part of the values. So our aim is to remove all the extra whitespace and organize it in a systematic way. We will use different methods which will help us to remove all the extra space from the cell’s. Different methods are : 

Using Strip() function
Using Skipinitialspace
Using replace function
Using Converters

Different methods to remove extra whitespace

Strip whitespace from Entire DataFrame Using Strip() function

Pandas provide predefine method “pandas.Series.str.strip()” to remove the whitespace from the string. Using strip function we can easily remove extra whitespace from leading and trailing whitespace from starting. It returns a series or index of an object. It takes set of characters that we want to remove from head and tail of string(leading and trailing character’s). By default, it is none and if we do not pass any characters then it will remove leading and trailing whitespace from the string. It returns a series or index of an object.

Syntax: pandas.Series.str.strip(to_strip = None)

Explanation: It takes set of characters that we want to remove from head and tail of string(leading and trailing character’s).

Parameter: By default it is none and if we do not pass any characters then it will remove leading and trailing whitespace from the string. It returns series or index of object. 

Example : 

Python3




# importing library
import pandas as pd
 
# Creating dataframe
df = pd.DataFrame({'Names' : [' Sunny','Bunny','Ginny ',' Binny ',' Chinni','Minni'],
                    'Age' : [23,44,23,54,22,11],
                    'Blood Group' : [' A+',' B+','O+','O-',' A-','B-'],
                   'Gender' : [' M',' M','F','F','F',' F']
                  })
 
# As dataset having lot of extra spaces in cell so lets remove them using strip() function
df['Names'].str.strip()
df['Blood Group'].str.strip()
df['Gender'].str.strip()
 
# Printing dataframe
print(df)


 Output: 

Strip whitespace from Entire DataFrame Using Skipinitialspace

It is not any method but it is one of the parameters present inside read_csv() method present in Pandas. Inside pandas.read_csv() method skipinitialspace parameter is present using which we can skip initial space present in our whole dataframe. By default, it is False, make it True to remove extra space. 

Syntax : pandas.read_csv(‘path_of_csv_file’, skipinitialspace = True)

 # By default value of skipinitialspace is False, make it True to use this parameter. 

Example : 

Python3




# importing library
import pandas as pd
 
# reading csv file and at a same time using skipinitial attribute which will remove extra space
df = pd.read_csv('\\student_data.csv', skipinitialspace = True)
 
# printing dataset
print(df)


 
 Output: 

Strip whitespace from Entire DataFrame Using replace function

Using replace() function also we can remove extra whitespace from the dataframe. Pandas provide predefine method “pandas.Series.str.replace()” to remove whitespace. Its program will be same as strip() method program only one difference is that here we will use replace function at the place of strip().

Syntax : pandas.Series.str.replace(' ', '')

Example : 

Python3




# importing library
import pandas as pd
 
# Creating dataframe
df = pd.DataFrame({'Name' : [' Sunny','Bunny','Ginny ',' Binny ',' Chinni','Minni'],
                    'Age' : [23,44,23,54,22,11],
                    'Blood Group' : [' A+',' B+','O+','O-',' A-','B-'],
                   'Gender' : [' M',' M','F','F','F',' F']
                  })
 
# As dataset having lot of extra spaces in cell so lets remove them using strip() function
df['Names'].str.replace(' ', '')
df['Blood Group'].str.replace(' ', '')
df['Gender'].str.replace(' ', '')
 
# Printing dataframe
print(df)


Output: 

Strip whitespace from Entire DataFrame Using Converters

It is similar as skipinitialspace, it is one of the parameter present inside pandas predefine method name “read_csv”. It is used to apply different functions on particular columns. We have to pass functions in the dictionary. Here we will pass strip() function directly which will remove the extra space during reading csv file.

Syntax : pd.read_csv(“path_of_file”, converters={‘column_names’: function_name})

# Pass dict of functions and column names, where column names act as unique keys and function as value.    

Example : 

Python3




# importing library
import pandas as pd
 
# reading csv file and at a same time using converters attribute which will remove extra space
df = pd.read_csv('\\student_data.csv', converters={'Name': str.strip(),
                                                'Blood Group' : str.strip(),
                                                'Gender' : str.strip() } )
 
# printing dataset
print(df)


Output: 

Removing Extra Whitespace from Whole DataFrame by Creating some 

Python3




# Importing required libraries
import pandas as pd
 
# Creating DataFrame having 4 columns and but
# the data is in unregularized way.
df = pd.DataFrame({'Names': [' Sunny', 'Bunny', 'Ginny ',
                             ' Binny ', ' Chinni', 'Minni'],
                    
                   'Age': [23, 44, 23, 54, 22, 11],
                    
                   'Blood_Group': [' A+', ' B+', 'O+', 'O-',
                                   ' A-', 'B-'],
                    
                   'Gender': [' M', ' M', 'F', 'F', 'F', ' F']
                   })
 
 
# Creating a function which will remove extra leading
# and tailing whitespace from the data.
# pass dataframe as a parameter here
def whitespace_remover(dataframe):
   
    # iterating over the columns
    for i in dataframe.columns:
         
        # checking datatype of each columns
        if dataframe[i].dtype == 'object':
             
            # applying strip function on column
            dataframe[i] = dataframe[i].map(str.strip)
        else:
             
            # if condn. is False then it will do nothing.
            pass
 
# applying whitespace_remover function on dataframe
whitespace_remover(df)
 
# printing dataframe
print(df)


In the above code snippet in first line we import required libraries, here pandas is used to perform  read, write and many other operation on data, then we created a DataFrame using pandas having 4 columns ‘Names’, ‘Age’, ‘Blood_Group’ and ‘Gender’. Almost all columns having irregular data. Now the major part begin from here, we created a function which will remove extra leading and trailing whitespace from the data. This function taking dataframe as a parameter and checking datatype of each column and if datatype of column is ‘Object’ then apply strip function which is predefined in pandas library on that column else it will do nothing. Then in next line we apply whitespace_remover() function on the dataframe which successfully remove the extra whitespace from the columns.

Output: 

RELATED ARTICLES

Most Popular

Recent Comments