Removing Blank Strings from a PySpark Dataframe

25 July 2024

0

Cleaning and preprocessing data is a crucial step before it can be used for analysis or modeling. One of the common tasks in data preparation is removing empty strings from a Spark dataframe. A Spark dataframe is a distributed collection of data that is organized into rows and columns. It can be processed using parallel and distributed algorithms, making it an efficient and powerful tool for large-scale data processing and analysis. They are a fundamental part of the Apache Spark ecosystem and are widely used in big data processing and analytics. Removing empty strings ensures the data is accurate, consistent, and ready to be used for downstream tasks.

Procedure to Remove Blank Strings from a Spark Dataframe using Python

To remove blank strings from a Spark DataFrame, follow these steps:

To load data into a Spark dataframe, one can use the spark.read.csv() method or create an RDD and then convert it to a dataframe using the toDF() method.
Once the data is loaded, the next step is to identify the columns that have empty strings by using the df.columns attribute and the df.select() method.
Then, use the df.filter() method to remove rows that have empty strings in the relevant columns. For example, df.filter(df.Name != ”) can be used to filter out rows that have empty strings in the “Name” column.
Finally, use the df.show() method to view the resulting dataframe and confirm that it does not have any empty strings.

Example 1.

Creating dataframe for demonestration.

Python3

# import the necessary libraries 
from pyspark.sql import *
from pyspark.sql.functions import *
  
# create a SparkSession 
spark = SparkSession.builder.appName('my_app').getOrCreate() 
  
# create the dataframe 
df = spark.createDataFrame([ 
    ('John', 23, 'Male'), 
    ('', 25, 'Female'), 
    ('Jane', 28, 'Female'), 
    ('', 30, 'Male') 
], ['Name', 'Age', 'Gender']) 
  
# examine the database 
df.show()

Output:

To remove rows that contain blank strings in the “Name” column, you can use the following code:

Python3

# Filter out the blank rows 
# from 'Name' column of df 
df = df.filter(df.Name != '') 
  
# Examine df 
df.show()

Output:

Example 2.

Creating dataframe for demonestration.

Python3

# import the necessary libraries 
from pyspark.sql import *
from pyspark.sql.functions import *
  
# create a SparkSession 
spark = SparkSession.builder.appName('my_app').getOrCreate() 
  
# create the dataframe 
df = spark.createDataFrame([ 
    ('John', 23, 'Male', '123 Main St.'), 
    ('', 25, 'Female', '456 Market St.'), 
    ('Jane', 28, 'Female', '789 Park Ave.'), 
    ('', 30, 'Male', '') 
], ['Name', 'Age', 'Gender', 'Address']) 
  
# examine the dataframe 
df.show()

Output:

To remove rows that contain blank strings in the “Name” and “Address” column, you can use the following code:

Python3

# filter out rows with blank strings 
# in the "Name" and "Address" columns 
df = df.filter((df.Name != '') & (df.Address != '')) 
  
# examine the dataframe 
df.show() 

Output:

Example 3.

Creating dataframe for demonestration.

Python3

# import the necessary libraries 
from functools import reduce
from pyspark.sql import *
from pyspark.sql.functions import *
  
# create a SparkSession 
spark = SparkSession.builder.appName('my_app').getOrCreate() 
  
# create the dataframe 
df = spark.createDataFrame([ 
    ('John', 23, 'Male', '123 Main St.', '555-1234'), 
    ('', 25, 'Female', '456 Market St.', ''), 
    ('Jane', 28, 'Female', '789 Park Ave.', '555-9876'), 
    ('', 30, 'Male', '', '555-4321') 
], ['Name', 'Age', 'Gender', 'Address', 'Phone']) 
  
# examine the dataframe 
df.show()

Output:

All the rows with empty strings may be filtered out as follows:

Python3

# filter out rows with blank strings in all the columns 
df = df.filter(reduce(lambda x, y: x & y,  
                      [col(c) != '' for c in df.columns])) 
  
# examine the dataframe 
df.show() 
  
# examine the dataframe 
df.show() 

Output:

In conclusion, it is often necessary to remove rows or columns that contain blank or empty strings from a Spark dataframe. This can be done using the df.filter() method, as illustrated in the article.

Removing Blank Strings from a PySpark Dataframe

Procedure to Remove Blank Strings from a Spark Dataframe using Python

Example 1.

Python3

Python3

Example 2.

Python3

Python3

Example 3.

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US