Identify corrupted records in a dataset using pyspark

20 July 2024

0

There can be datasets that may contain corrupt records. Those records don’t follow data-specific rules that are followed by correct records e.g., a corrupt record may have been delimited with a pipe (“|”) character but the rest of other records are delimited by comma (“,”), and it is mentioned to read data from that file with comma separator. In this article, I will demonstrate different ways to identify corrupt records and get rid of corrupt records.

PySpark Read CSV file

PERMISSIVE
DROPMALFORMED
FAILFAST

PERMISSIVE

It is the default mode. In “Permissive” Mode, “NULLs” are inserted for the Fields that could Not be Parsed correctly. If you want to retain bad records in dataframe , Use “columnNameOfCorruptRecord” option to identify bad records.

Example

This PySpark code reads a CSV file, identifies corrupted records, and counts the total number of records. It sets a schema for the data, reads the CSV with specified options (including handling corrupted records), filters and displays the corrupted records, and provides the total record count.

Python3

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import getpass
username = getpass.getuser()
spark = SparkSession.builder.appName("Identify corrupted records").getOrCreate()
customers_schema = '''
customer_id int,
customer_fname string,
customer_lname string,
customer_email string,
customer_password string,
customer_street string,
customer_city string,
customer_state string,
customer_zipcode int,
_corrupt_record string
'''
customers_df = spark.read.schema(customers_schema).format("csv").options(header = True,delimiter = "|", 
              mode = "PERMISSIVE", columnNameOfCorruptRecord = "_corrupt_record").load(
               f"C:\\Users\\{username}\\Desktop\\PYSPARK\\source\\corrupted_customer_details.csv")
customers_df.filter(col("_corrupt_record").isNotNull()).show()
print(f"Total number of records while reading in PERMISSIVE mode : {customers_df.count()}")

In the above code “_corrupt_record” column is used to store the corrupted records.

Output

Read data in PERMISSIVE mode

DROPMALFORMED

This mode is used to drop corrupted records while trying to read from a given dataset.

Example

This PySpark code reads a CSV file and drops any malformed or corrupted records. It sets a schema for the data, reads the CSV with specified options (including dropping malformed records), displays the cleaned dataset, and provides the total record count.

Python3

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import getpass
 
username = getpass.getuser()
spark = SparkSession.builder.appName("Drop corrupted records").getOrCreate()
customers_schema = '''
customer_id int,
customer_fname string,
customer_lname string,
customer_email string,
customer_password string,
customer_street string,
customer_city string,
customer_state string,
customer_zipcode int
'''
customers_df = spark.read.schema(
  customers_schema).format("csv").options(header = True,
       delimiter = "|", mode = "DROPMALFORMED").load(f"C:\\Users\\{
        username}\\Desktop\\PYSPARK\\source\\corrupted_customer_details.csv")
customers_df.show()
print(f"Total number of records while reading in \
            DROPMALFORMED mode : {customers_df.count()}")

In the output you will find 9 records. But ‘DROPMALFORMED’ is not going to change total number of records in ‘customers_df’.

Output

Read data in DROPMALFORMED mode

FAILFAST

This mode will throw error if malformed records are detected while trying to read from a given dataset.

Example

This PySpark code reads a CSV file in “FAILFAST” mode, which means it will fail and raise an exception if it encounters any malformed records that do not adhere to the specified schema. It sets a schema for the data, reads the CSV with the specified options, displays the dataset, and provides the total record count. If any malformed records are encountered, it will raise an exception and print the error message.

Python3

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import getpass
 
username = getpass.getuser()
spark = SparkSession.builder.appName(
              "Read data in FAILFAST mode").getOrCreate()
customers_schema = '''
customer_id int,
customer_fname string,
customer_lname string,
customer_email string,
customer_password string,
customer_street string,
customer_city string,
customer_state string,
customer_zipcode int
'''
 
try:
    customers_df = spark.read.schema(
      customers_schema).format("csv").options(header = True,
      delimiter = "|", mode = "FAILFAST").load(f"C:\\Users\\{
        username}\\Desktop\\PYSPARK\\source\\corrupted_customer_details.csv")
    customers_df.show()
    print(f"Total number of records while reading in FAILFAST mode : {
                                                customers_df.count()}")
except Exception as e:
    print(e)

Since, there is one corrupt record in the dataset therefore it is going to raise exception. Advantage of FAILFAST mode is it will not allow to proceed with working on a dataset if it contains corrupted records.

Output:

Read data in FAILFAST mode

Identify corrupted records in a dataset using pyspark

PySpark Read CSV file

PERMISSIVE

Python3

DROPMALFORMED

Python3

FAILFAST

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US