In this article, we are going to see how to orderby multiple columns in PySpark DataFrames through Python.
Create the dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.show() |
Output:
orderby means we are going to sort the dataframe by multiple columns in ascending or descending order. we can do this by using the following methods.
Method 1 : Using orderBy()
This function will return the dataframe after ordering the multiple columns. It will sort first based on the column name given.
Syntax:
- Ascending order: dataframe.orderBy([‘column1′,’column2′,……,’column n’], ascending=True).show()
- Descending Order: dataframe.orderBy([‘column1′,’column2′,……,’column n’], ascending=False).show()
where:
- dataframe is the Pyspark Input dataframe
- ascending=True specifies to sort the dataframe in ascending order
- ascending=False specifies to sort the dataframe in descending order
Example 1: Sort the PySpark dataframe in ascending order with orderBy().
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # orderBy dataframe in asc order dataframe.orderBy([ 'Name' , 'ID' , 'Company' ], ascending = True ).show() |
Output:
Example 2: Sort the PySpark dataframe in descending order with orderBy().
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # orderBy dataframe in desc order dataframe.orderBy([ 'Name' , 'ID' , 'Company' ], ascending = False ).show() |
Output:
Method 2: Using sort()
This function will return the dataframe after ordering the multiple columns. It will sort first based on the column name given.
Syntax:
- Ascending order: dataframe.sort([‘column1′,’column2′,……,’column n’], ascending=True).show()
- Descending Order: dataframe.sort([‘column1′,’column2′,……,’column n’], ascending=False).show()
where,
- dataframe is the Pyspark Input dataframe
- ascending=True specifies to sort the dataframe in ascending order
- ascending=False specifies to sort the dataframe in descending order
Example 1: Sort PySpark dataframe in ascending order
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # orderBy dataframe in asc order dataframe.sort([ 'Name' , 'ID' , 'Company' ], ascending = True ).show() |
Output:
Example 2: Sort the PySpark dataframe in descending order
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 1" ], [ "3" , "rohith" , "company 2" ], [ "4" , "sridevi" , "company 1" ], [ "5" , "bobby" , "company 1" ]] # specify column names columns = [ 'ID' , 'NAME' , 'Company' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # orderBy dataframe in desc order dataframe.sort([ 'Name' , 'ID' , 'Company' ], ascending = False ).show() |
Output: