In this article, we will see how to sort the data frame by specified columns in PySpark. We can make use of orderBy() and sort() to sort the data frame in PySpark
OrderBy() Method:
OrderBy() function is used to sort an object by its index value.
Syntax: DataFrame.orderBy(cols, args)
Parameters :
- cols: List of columns to be ordered
- args: Specifies the sorting order i.e (ascending or descending) of columns listed in cols
Return type: Returns a new DataFrame sorted by the specified columns.
Dataframe Creation: Create a new SparkSession object named spark then create a data frame with the custom data.
# Importing necessary libraries from pyspark.sql import SparkSession from pyspark.sql import functions as f # Create a spark session spark = SparkSession.builder.appName( 'pyspark - example join' ).getOrCreate() # Define data in a dataframe dataframe = [ ( "Sam" , "Software Engineer" , "IND" , 10000 ), ( "Raj" , "Data Scientist" , "US" , 41000 ), ( "Jonas" , "Sales Person" , "UK" , 230000 ), ( "Peter" , "CTO" , "Ireland" , 50000 ), ( "Hola" , "Data Analyst" , "Australia" , 111000 ), ( "Ram" , "CEO" , "Iran" , 300000 ), ( "Lekhana" , "Advertising" , "UK" , 250000 ), ( "Thanos" , "Marketing" , "UIND" , 114000 ), ( "Nick" , "Data Engineer" , "Ireland" , 680000 ), ( "Wade" , "Data Engineer" , "IND" , 70000 ) ] # Column names of dataframe columns = [ "Name" , "Job" , "Country" , "salary" ] # Create the spark dataframe df = spark.createDataFrame(data = dataframe, schema = columns) # Printing the dataframe |
Output :
Example 1: Sorting the data frame by a single column
Sort the data frame by the ascending order of ‘Salary’ of employees in the data frame.
# Order the data by ascending order # of Salary df.orderBy([ 'Salary' ], ascending = [ True ]).show() # or # df.orderBy(f.col("Salary").asc()).show() # or # df.orderBy(['Salary']).show() |
Output :
Example 2: Sorting the data frame in decreasing order.
# Order the data by dec order # of Salary df.orderBy([ 'Salary' ], ascending = [ False ]).show() |
Example 3: Sorting the data frame by more than one column
Sort the data frame by the descending order of ‘Job’ and ascending order of ‘Salary’ of employees in the data frame. When there is a conflict between two rows having the same ‘Job’, then it’ll be resolved by listing rows in the ascending order of ‘Salary’.
# Sort the dataframe by descending order # of 'Job' and whenever there is conflict # in 'Job', it'll be resolved by ordering # based on ascending order of 'Salary' df.orderBy(f.col( "Job" ).desc(),f.col( "Salary" ).asc()).show() # or # df.orderBy(["Job", "Salary"],ascending = [False, True]).show() |
Output :
Sort() method:
It takes the Boolean value as an argument to sort in ascending or descending order.
sort(x, decreasing, na.last)Parameters:
x: list of Column or column names to sort by
decreasing: Boolean value to sort in descending order
na.last: Boolean value to put NA at the end
Example 1: Sort the data frame by the ascending order of the “Name” of the employee.
# Sort the dataframe by ascending # order of 'Name' df.sort([ "Name" ],ascending = [ True ]).show() |
Output :
Example 2: Sort the column in decreasing order.
# Sort the dataframe by scendding order of 'Name' df.sort([ "Name" ],ascending = [ False ]).show() |
Example 3: Sort multiple columns in ascending order.
# Sort the dataframe by acendding order of 'Name' df.sort([ "Name" , "salary" ],ascending = [ True ]).show() |