In this article, we are going to get the extract first N rows and Last N rows from the dataframe using PySpark in Python. To do our task first we will create a sample dataframe.
We have to create a spark object with the help of the spark session and give the app name by using getorcreate() method.
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
Finally, after creating the data with the list and column list to the method:
dataframe = spark.createDataFrame(data, columns)
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data with 5 row values data = [[ "1" , "sravan" , "company 1" ], [ "2" , "ojaswi" , "company 2" ], [ "3" , "bobby" , "company 3" ], [ "4" , "rohith" , "company 2" ], [ "5" , "gnanesh" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) print ( 'Actual data in dataframe' ) dataframe.show() |
Output:
Extracting first N rows
We can extract the first N rows by using several methods which are discussed below with the help of some examples:
Method 1: Using head()
This function is used to extract top N rows in the given dataframe
Syntax: dataframe.head(n)
where,
- n specifies the number of rows to be extracted from first
- dataframe is the dataframe name created from the nested lists using pyspark.
Python3
print ( "Top 2 rows " ) # extract top 2 rows a = dataframe.head( 2 ) print (a) print ( "Top 1 row " ) # extract top 1 row a = dataframe.head( 1 ) print (a) |
Output:
Top 2 rows
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),
Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]
Top 1 row
[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]
Method 2: Using first()
This function is used to extract only one row in the dataframe.
Syntax: dataframe.first()
- It doesn’t take any parameter
- dataframe is the dataframe name created from the nested lists using pyspark
Python3
print ( "Top row " ) # extract top row a = dataframe.first() print (a) |
Output:
Top row
Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)
Method 3: Using show()
Used to display the dataframe from top to bottom by default.
Syntax: dataframe.show(n)
where,
- dataframe is the input dataframe
- n is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe
Python3
# show() function to get # 2 rows dataframe.show( 2 ) |
Output:
Extracting Last N rows
Extracting the last rows means getting the last N rows from the given dataframe. For this, we are using tail() function and can get the last N rows
Syntax: dataframe.tail(n)
where,
- n is the number to get last n rows
- data frame is the input dataframe
Example:
Python3
print ( "Last 2 rows " ) # extract last 2 rows a = dataframe.tail( 2 ) print (a) print ( "Last 1 row " ) # extract last 1 row a = dataframe.tail( 1 ) print (a) |
Output:
Last 2 rows
[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),
Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]
Last 1 row
[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]