PySpark partitionBy() method

27 July 2024

1

PySpark partitionBy() is used to partition based on column values while writing DataFrame to Disk/File system. When you write DataFrame to Disk by calling partitionBy() Pyspark splits the records based on the partition column and stores each partition data into a sub-directory.

PySpark Partition is a way to split a large dataset into smaller datasets based on one or more partition keys. You can also create a partition on multiple columns using partitionBy(), just pass columns you want to partition as an argument to this method.

Syntax: partitionBy(self, *cols)

Let’s Create a DataFrame by reading a CSV file. You can find the dataset at this link Cricket_data_set_odi.csv

Create dataframe for demonstration:

Python3

# importing module
import pyspark
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# create DataFrame
df=spark.read.option(
  "header",True).csv("Cricket_data_set_odi.csv")
  
# Display schema
df.printSchema() 

Output:

PySpark partitionBy() with One column:

From the above DataFrame, we will be use Team as a partition key for our examples below:

Python3

df.write.option("header", True) \
        .partitionBy("Team") \
        .mode("overwrite") \
        .csv("Team")
  
# change directory
cd Team
  
# On our DataFrame, we have a total
# of 9 different teams hence,
# it creates 9 directories as shown below.
# The name of the sub-directory would be
# the partition column and its value 
# (partition column=value).
ls

Output:

PySpark partitionBy() with Multiple Columns:

You can also create partitions on multiple columns using PySpark partitionBy(). Just pass columns you want to partition as arguments to this method.

From the above DataFrame, we are using Team and Speciality as a partition key for our examples below.

Python3

# From above DataFrame, we will be using 
# Team and Speciality as a partition key 
# for our examples below.
# partitionBy()
df.write.option("header", True) \
        .partitionBy("Team", "Speciality") \
        .mode("overwrite") \
        .csv("Team-Speciality")
  
# change directory
cd Team = Ind
cd Team-Speciality
cd Team = Ind
ls

Output:

Control Number of Records per Partition File:

Use the option maxRecordsPerFile if you want to control the number of records for each partition. This is especially helpful when your data is skewed (some partitions with very few records and other partitions with high numbers of records).

Python3

# partitionBy() control number of partitions
df.write.option("header",True) \
        .option("maxRecordsPerFile", 2) \
        .partitionBy("Team") \
        .mode("overwrite") \
        .csv("Team")
# change directory
cd Team
ls

Output:

PySpark partitionBy() method

Python3

PySpark partitionBy() with One column:

Python3

PySpark partitionBy() with Multiple Columns:

Python3

Control Number of Records per Partition File:

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

This is my surprise phone of the year [Video]

Recent Comments

EDITOR PICKS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR POSTS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR CATEGORY

ABOUT US

FOLLOW US