In this article, we will discuss how to get the specific row from the PySpark dataframe.
Creating Dataframe for demonstration:
Python3
# importing module import pyspark # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data with 5 row values data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 2" ], [ "3" , "bobby" , "company 3" ], [ "4" , "rohith" , "company 2" ], [ "5" , "gnanesh" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe dataframe.show() |
Output:
Method 1: Using collect()
This is used to get the all row’s data from the dataframe in list format.
Syntax: dataframe.collect()[index_position]
Where,
- dataframe is the pyspark dataframe
- index_position is the index row in dataframe
Example: Python code to access rows
Python3
# get first row print (dataframe.collect()[ 0 ]) # get second row print (dataframe.collect()[ 1 ]) # get last row print (dataframe.collect()[ - 1 ]) # get third row print (dataframe.collect()[ 2 ]) |
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)
Method 2: Using show()
This function is used to get the top n rows from the pyspark dataframe.
Syntax: dataframe.show(no_of_rows)
where, no_of_rows is the row number to get the data
Example: Python code to get the data using show() function
Python3
# display dataframe only top 2 rows print (dataframe.show( 2 )) # display dataframe only top 1 row print (dataframe.show( 1 )) # display dataframe print (dataframe.show()) |
Output:
Method 3: Using first()
This function is used to return only the first row in the dataframe.
Syntax: dataframe.first()
Example: Python code to select the first row in the dataframe.
Python3
# display first row of the dataframe print (dataframe.first()) |
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Method 4: Using head()
This method is used to display top n rows in the dataframe.
Syntax: dataframe.head(n)
where, n is the number of rows to be displayed
Example: Python code to display the number of rows to be displayed.
Python3
# display only 1 row print (dataframe.head( 1 )) # display only top 3 rows print (dataframe.head( 3 )) # display only top 2 rows print (dataframe.head( 2 )) |
Output:
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
Method 5: Using tail()
Used to return last n rows in the dataframe
Syntax: dataframe.tail(n)
where n is the no of rows to be returned from last in the dataframe.
Example: Python code to get last n rows
Python3
# display only 1 row from last print (dataframe.tail( 1 )) # display only top 3 rows from last print (dataframe.tail( 3 )) # display only top 2 rows from last print (dataframe.tail( 2 )) |
Output:
[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
[Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
Method 6: Using select() with collect() method
This method is used to select a particular row from the dataframe, It can be used with collect() function.
Syntax: dataframe.select([columns]).collect()[index]
where,
- dataframe is the pyspark dataframe
- Columns is the list of columns to be displayed in each row
- Index is the index number of row to be displayed.
Example: Python code to select the particular row.
Python3
# select first row print (dataframe.select([ 'Employee ID' , 'Employee NAME' , 'Company Name' ]).collect()[ 0 ]) # select third row print (dataframe.select([ 'Employee ID' , 'Employee NAME' , 'Company Name' ]).collect()[ 2 ]) # select forth row print (dataframe.select([ 'Employee ID' , 'Employee NAME' , 'Company Name' ]).collect()[ 3 ]) |
Output:
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)
Method 7: Using take() method
This method is also used to select top n rows
Syntax: dataframe.take(n)
where n is the number of rows to be selected
Python3
# select top 2 rows print (dataframe.take( 2 )) # select top 4 rows print (dataframe.take( 4 )) # select top 1 row print (dataframe.take( 1 )) |
Output:
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),
Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),
Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)]
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]