Friday, December 27, 2024
Google search engine
HomeLanguagesGet specific row from PySpark dataframe

Get specific row from PySpark dataframe

In this article, we will discuss how to get the specific row from the PySpark dataframe.

Creating Dataframe for demonstration:

Python3




# importing module
import pyspark
  
# importing sparksession
# from pyspark.sql module
from pyspark.sql import SparkSession
  
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
# list  of employee data with 5 row values
data = [["1", "sravan", "company 1"],
        ["2", "ojaswi", "company 2"],
        ["3", "bobby", "company 3"],
        ["4", "rohith", "company 2"],
        ["5", "gnanesh", "company 1"]]
  
# specify column names
columns = ['Employee ID', 'Employee NAME',
           'Company Name']
  
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
  
# display dataframe
dataframe.show()


Output:

Method 1: Using collect()

This is used to get the all row’s data from the dataframe in list format.

Syntax: dataframe.collect()[index_position]

Where,

  • dataframe is the pyspark dataframe
  • index_position is the index row in dataframe

Example: Python code to access rows

Python3




# get first row
print(dataframe.collect()[0])
  
# get second row
print(dataframe.collect()[1])
  
# get last row
print(dataframe.collect()[-1])
  
# get third row
print(dataframe.collect()[2])


Output:

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)

Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)

Method 2: Using show()

This function is used to get the top n rows from the pyspark dataframe.

Syntax: dataframe.show(no_of_rows)

where, no_of_rows is the row number to get the data

Example: Python code to get the data using show() function

Python3




# display dataframe only top 2 rows
print(dataframe.show(2))
  
# display dataframe only top 1 row
print(dataframe.show(1))
  
# display dataframe 
print(dataframe.show())


Output:

Method 3: Using first()

This function is used to return only the first row in the dataframe.

Syntax: dataframe.first()

Example: Python code to select the first row in the dataframe.

Python3




# display first row of the dataframe
print(dataframe.first())


Output:

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Method 4: Using head()

This method is used to display top n rows in the dataframe.

Syntax: dataframe.head(n)

where, n is the number of rows to be displayed

Example: Python code to display the number of rows to be displayed.

Python3




# display only 1 row
print(dataframe.head(1))
  
# display only top 3  rows
print(dataframe.head(3))
  
# display only top 2 rows
print(dataframe.head(2))


Output:

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′), 

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′), 

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′), 

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

Method 5: Using tail()

Used to return last n rows in the dataframe

Syntax: dataframe.tail(n)

where n is the no of rows to be returned from last in the dataframe.

Example: Python code to get last n rows

Python3




# display only 1 row from last
print(dataframe.tail(1))
  
# display only top 3  rows from last
print(dataframe.tail(3))
  
# display only top 2 rows from last
print(dataframe.tail(2))


Output:

[Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

[Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),

 Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),

  Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

[Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′),

 Row(Employee ID=’5′, Employee NAME=’gnanesh’, Company Name=’company 1′)]

Method 6: Using select() with collect() method

This method is used to select a particular row from the dataframe, It can be used with collect() function.

Syntax: dataframe.select([columns]).collect()[index]

where, 

  • dataframe is the pyspark dataframe
  • Columns is the list of columns to be displayed in each row
  • Index is the index number of row to be displayed.

Example: Python code to select the particular row.

Python3




# select first row
print(dataframe.select(['Employee ID',
                        'Employee NAME',
                        'Company Name']).collect()[0])
  
# select third row
print(dataframe.select(['Employee ID',
                        'Employee NAME',
                        'Company Name']).collect()[2])
  
# select forth row
print(dataframe.select(['Employee ID',
                        'Employee NAME',
                        'Company Name']).collect()[3])


Output:

Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)

Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′)

Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)

Method 7: Using take() method

This method is also used to select top n rows

Syntax: dataframe.take(n)

where n is the number of rows to be selected

Python3




# select top 2 rows
print(dataframe.take(2))
  
# select top 4 rows
print(dataframe.take(4))
  
# select top 1 row
print(dataframe.take(1))


Output:

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′), 

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′),

Row(Employee ID=’2′, Employee NAME=’ojaswi’, Company Name=’company 2′),

 Row(Employee ID=’3′, Employee NAME=’bobby’, Company Name=’company 3′),

  Row(Employee ID=’4′, Employee NAME=’rohith’, Company Name=’company 2′)]

[Row(Employee ID=’1′, Employee NAME=’sravan’, Company Name=’company 1′)]

RELATED ARTICLES

Most Popular

Recent Comments