Saturday, November 16, 2024
Google search engine
HomeLanguagesExtract First and last N rows from PySpark DataFrame

Extract First and last N rows from PySpark DataFrame

In this article, we are going to get the extract first N rows and Last N rows from the dataframe using PySpark in Python. To do our task first we will create a sample dataframe.

We have to create a spark object with the help of the spark session and give the app name by using getorcreate() method.

spark = SparkSession.builder.appName('sparkdf').getOrCreate()

Finally, after creating the data with the list and column list to the method:

dataframe = spark.createDataFrame(data, columns)

Python3




# importing module
import pyspark
  
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"]]
  
# specify column names
columns = ['Employee ID', 'Employee NAME', 'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
print('Actual data in dataframe')
dataframe.show()


Output:

Extracting first N rows

We can extract the first N rows by using several methods which are discussed below with the help of some examples:

Method 1: Using head()

This function is used to extract top N rows in the given dataframe

Syntax: dataframe.head(n)

where, 

  • n specifies the number of rows to be extracted from first
  • dataframe is the dataframe name created from the nested lists using pyspark.

Python3




print("Top 2 rows ")
  
# extract top 2 rows
a = dataframe.head(2)
print(a)
  
print("Top 1 row ")
  
# extract top 1 row
a = dataframe.head(1)
print(a)


Output:

Top 2 rows  

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′), 

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

Top 1 row  

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

Method 2: Using first()

This function is used to extract only one row in the dataframe.

Syntax: dataframe.first()

  • It doesn’t take any parameter
  • dataframe is the dataframe name created from the nested lists using pyspark

Python3




print("Top row ")
  
# extract top  row
a = dataframe.first()
print(a)


Output:

Top row  

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Method 3: Using show() 

Used to display the dataframe from top to bottom by default.

Syntax: dataframe.show(n)

where,

  • dataframe is the input dataframe
  • n is the number of rows to be displayed from the top ,if n is not specified it will print entire rows in the dataframe

Python3




# show() function to get 
# 2 rows
dataframe.show(2)


Output:

Extracting Last N rows

Extracting the last rows means getting the last N rows from the given dataframe. For this, we are using tail() function and can get the last N rows

Syntax: dataframe.tail(n)

where,

  • n is the number to get last n rows
  • data frame is the input dataframe

Example:

Python3




print("Last 2 rows ")
  
# extract last 2 rows
a = dataframe.tail(2)
print(a)
  
print("Last 1 row ")
  
# extract last 1 row
a = dataframe.tail(1)
print(a)


Output:

Last 2 rows  

[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′), 

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

Last 1 row  

[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

RELATED ARTICLES

Most Popular

Recent Comments