In this article, we are going to sort by value in PySpark.
Creating RDD for demonstration:
Python3
# importing module from pyspark.sql import SparkSession, Row # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # create 2 Rows with 3 columns data = Row(First_name = "Sravan" , Last_name = "Kumar" , age = 23 ), Row(First_name = "Ojaswi" , Last_name = "Pinkey" , age = 16 ), Row(First_name = "Rohith" , Last_name = "Devi" , age = 7 ) # create row on rdd rdd = spark.sparkContext.parallelize(data) # display data rdd.collect() |
Output:
[Row(First_name='Sravan', Last_name='Kumar', age=23), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7)]
Method 1: Using sortBy()
sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.
Syntax: rdd.sortBy(lambda expression)
It uses a lambda expression to sort the data based on columns.
lambda expression: lambda x: x[column_index]
Example 1: Sort the data by values based on column 1
Python3
# sort the data by values based on column 1 rdd.sortBy( lambda x: x[ 0 ]).collect() |
Output:
[Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Sravan', Last_name='Kumar', age=23)]
Example 2: Sort data based on column 2 values
Python3
# sort the data by values based on column 2 rdd.sortBy( lambda x: x[ 2 ]).collect() |
Output:
[Row(First_name='Rohith', Last_name='Devi', age=7), Row(First_name='Ojaswi', Last_name='Pinkey', age=16), Row(First_name='Sravan', Last_name='Kumar', age=23)]
Method 2: Using takeOrdered()
It is the method available in RDD, this is used to sort values based on values in a particular column.
Syntax: rdd.takeOrdered(n,lambda expression)
where, n is the total rows to be displayed after sorting
Sort values based on a particular column using takeOrdered function
Python3
# sort values based on # column 1 using takeOrdered function print (rdd.takeOrdered( 3 , lambda x: x[ 0 ])) # sort values based on # column 3 using takeOrdered function print (rdd.takeOrdered( 3 , lambda x: x[ 2 ])) |
Output:
[Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]
[Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]