Did you ever get the requirement to show the partitions on a Pyspark RDD for the data frame you uploaded or partition the data and check if has been correctly partitioned? Don’t know, how to achieve this. You can do this by using the getNumPartitions functions of Pyspark RDD. Want to know more about it? Read the article further, where we will discuss the same.
Show partitions on a Pyspark RDD in Python
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python:
To get the number of partitions on pyspark RDD, you need to convert the data frame to RDD data frame. For showing partitions on Pyspark RDD use:
data_frame_rdd.getNumPartitions()
First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session. Now, create a spark session using the getOrCreate function. Then, read the CSV file and display it to see if it is correctly uploaded. Next, convert the data frame to the RDD data frame. Finally, get the number of partitions using the getNumPartitions function.
Example 1:
In this example, we have read the CSV file (link) and shown partitions on Pyspark RDD using the getNumPartitions function.
Python3
# Python program to show partitions on RDD pyspark # Import the SparkSession library. from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark.read.csv( 'california_housing_train.csv' , sep = ',' , inferSchema = True , header = True ) # Display the csv file read data_frame.show() # Convert dataframe to RDD dataframe data_frame_rdd = data_frame.rdd # Show partitions on pyspark RDD using # getNumPartitions function print (data_frame_rdd.getNumPartitions()) |
Output:
Example 2:
In this example, we have read the CSV file (link) and shown partitions on Pyspark RDD using the getNumPartitions function. Further, we have repartitioned that data and again shown partitions on Pyspark RDD of the new partitioned data.
Python3
# Python program to show partitions on RDD pyspark # Import the SparkSession library. from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark = SparkSession.builder.getOrCreate() # Read the CSV file data_frame_1 = csv_file = spark.read.csv(california_housing_train.csv', sep = ',' , inferSchema = True , header = True ) # Display the csv file read data_frame_1.show() # Convert dataframe to RDD dataframe data_frame_rdd_1 = data_frame_1.rdd # Show partitions on pyspark RDD # using getNumPartitions function print (data_frame_rdd_1.getNumPartitions()) # Repartition the CSV file by longitude, latitude, # housing_median_age, and total_rooms columns data_frame_2 = data_frame_1.select(data_frame_1.longitude, data_frame_1.latitude, data_frame_1.housing_median_age, data_frame_1.total_rooms).repartition( 4 ) # Convert dataframe to RDD dataframe data_frame_rdd_2 = data_frame_2.rdd # Show partitions on pyspark RDD using getNumPartitions function print (data_frame_rdd_2.getNumPartitions()) |
Output: