Sunday, November 17, 2024
Google search engine
HomeLanguagesHow to sort by value in PySpark?

How to sort by value in PySpark?

In this article, we are going to sort by value in PySpark.

Creating RDD for demonstration:

Python3




# importing module
from pyspark.sql import SparkSession, Row
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# create 2 Rows with 3 columns
data = Row(First_name="Sravan", Last_name="Kumar", age=23),
Row(First_name="Ojaswi", Last_name="Pinkey", age=16),
Row(First_name="Rohith", Last_name="Devi", age=7)
  
# create row on rdd
rdd = spark.sparkContext.parallelize(data)
  
# display data
rdd.collect()


Output:

[Row(First_name='Sravan', Last_name='Kumar', age=23),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7)]

Method 1: Using sortBy()

sortBy() is used to sort the data by value efficiently in pyspark. It is a method available in rdd.

Syntax: rdd.sortBy(lambda expression)

It uses a lambda expression to sort the data based on columns.

lambda expression: lambda x: x[column_index]

Example 1: Sort the data by values based on column 1

Python3




# sort the data by values based on column 1
rdd.sortBy(lambda x: x[0]).collect()


Output:

[Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Example 2: Sort data based on column 2 values

Python3




# sort the data by values based on column 2
rdd.sortBy(lambda x: x[2]).collect()


Output:

[Row(First_name='Rohith', Last_name='Devi', age=7),
Row(First_name='Ojaswi', Last_name='Pinkey', age=16),
Row(First_name='Sravan', Last_name='Kumar', age=23)]

Method 2: Using takeOrdered()

It is the method available in RDD, this is used to sort values based on values in a particular column.

Syntax: rdd.takeOrdered(n,lambda expression)

where, n is the total rows to be displayed after sorting

Sort values based on a particular column using takeOrdered function

Python3




# sort values based on
# column 1 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[0]))
  
# sort values based on
# column 3 using takeOrdered function
print(rdd.takeOrdered(3,lambda x: x[2]))


Output:

[Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]

[Row(First_name=’Rohith’, Last_name=’Devi’, age=7), Row(First_name=’Ojaswi’, Last_name=’Pinkey’, age=16), Row(First_name=’Sravan’, Last_name=’Kumar’, age=23)]

RELATED ARTICLES

Most Popular

Recent Comments