Get number of rows and columns of PySpark dataframe

28 July 2024

2

In this article, we will discuss how to get the number of rows and the number of columns of a PySpark dataframe. For finding the number of rows and number of columns we will use count() and columns() with len() function respectively.

df.count(): This function is used to extract number of rows from the Dataframe.
df.distinct().count(): This functions is used to extract distinct number rows which are not duplicate/repeating in the Dataframe.
df.columns(): This function is used to extract the list of columns names present in the Dataframe.
len(df.columns): This function is used to count number of items present in the list.

Example 1: Get the number of rows and number of columns of dataframe in pyspark.

Python

# importing necessary libraries
from pyspark.sql import SparkSession
 
# function to create SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Products.com") \
      .getOrCreate()
  return spk
 
# function to create Dataframe
def create_df(spark,data,schema):
  df1 = spark.createDataFrame(data,schema)
  return df1
 
# main function
if __name__ == "__main__":
 
  # calling function to create SparkSession
  spark = create_session()
     
  input_data = [(1,"Direct-Cool Single Door Refrigerator",12499),
          (2,"Full HD Smart LED TV",49999),
          (3,"8.5 kg Washing Machine",69999),
          (4,"T-shirt",1999),
          (5,"Jeans",3999),
          (6,"Men's Running Shoes",1499),
          (7,"Combo Pack Face Mask",999)]
 
  schm = ["Id","Product Name","Price"]
 
  # calling function to create dataframe
  df = create_df(spark,input_data,schm)
  df.show()
 
  # extracting number of rows from the Dataframe
  row = df.count()
   
  # extracting number of columns from the Dataframe
  col = len(df.columns)
 
  # printing
  print(f'Dimension of the Dataframe is: {(row,col)}')
  print(f'Number of Rows are: {row}')
  print(f'Number of Columns are: {col}')

Output:

Explanation:

For counting the number of rows we are using the count() function df.count() which extracts the number of rows from the Dataframe and storing it in the variable named as ‘row’
For counting the number of columns we are using df.columns() but as this function returns the list of columns names, so for the count the number of items present in the list we are using len() function in which we are passing df.columns() this gives us the total number of columns and store it in the variable named as ‘col’.

Example 2: Getting the Distinct number of rows and columns of Dataframe.

Python

# importing necessary libraries
from pyspark.sql import SparkSession
 
# function to create SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Student_report.com") \
      .getOrCreate()
  return spk
 
# function to create Dataframe
def create_df(spark,data,schema):
  df1 = spark.createDataFrame(data,schema)
  return df1
 
# main function
if __name__ == "__main__":
 
  # calling function to create SparkSession
  spark = create_session()
     
  input_data = [(1,"Shivansh","Male",20,80),
          (2,"Arpita","Female",18,66),
          (3,"Raj","Male",21,90),
          (4,"Swati","Female",19,91),
          (5,"Arpit","Male",20,50),
          (6,"Swaroop","Male",23,65),
          (6,"Swaroop","Male",23,65),
          (6,"Swaroop","Male",23,65),
          (7,"Reshabh","Male",19,70),
          (7,"Reshabh","Male",19,70),
          (8,"Dinesh","Male",20,75),
          (9,"Rohit","Male",21,85),
          (9,"Rohit","Male",21,85),
          (10,"Sanjana","Female",22,87)]
 
  schm = ["Id","Name","Gender","Age","Percentage"]
 
  # calling function to create dataframe
  df = create_df(spark,input_data,schm)
  df.show()
 
  # extracting number of distinct rows
  # from the Dataframe
  row = df.distinct().count()
   
  # extracting total number of rows from
  # the Dataframe
  all_rows = df.count()
   
  # extracting number of columns from the
  # Dataframe
  col = len(df.columns)
 
  # printing
  print(f'Dimension of the Dataframe is: {(row,col)}')
  print(f'Distinct Number of Rows are: {row}')
  print(f'Total Number of Rows are: {all_rows}')
  print(f'Number of Columns are: {col}')

Output:

Explanation:

For counting the number of distinct rows we are using distinct().count() function which extracts the number of distinct rows from the Dataframe and storing it in the variable named as ‘row’
For counting the number of columns we are using df.columns() but as this functions returns the list of column names, so for the count the number of items present in the list we are using len() function in which we are passing df.columns() this gives us the total number of columns and store it in the variable named as ‘col’

Example 3: Getting the number of columns using dtypes function.

In the example, after creating the Dataframe we are counting a number of rows using count() function and for counting the number of columns here we are using dtypes function. Since we know that dtypes function returns the list of tuples that contains the column name and datatype of the columns. So for every column, there is the tuple that contains the name and datatype of the column, from the list we are just counting the tuples The number of tuples is equal to the number of columns so this is also the one way to get the number of columns using dtypes function.

Python

# importing necessary libraries
from pyspark.sql import SparkSession
 
# function to create SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Student_report.com") \
      .getOrCreate()
  return spk
 
# function to create Dataframe
def create_df(spark,data,schema):
  df1 = spark.createDataFrame(data,schema)
  return df1
 
# main function
if __name__ == "__main__":
 
  # calling function to create SparkSession
  spark = create_session()
     
  input_data = [(1,"Shivansh","Male",20,80),
          (2,"Arpita","Female",18,66),
          (3,"Raj","Male",21,90),
          (4,"Swati","Female",19,91),
          (5,"Arpit","Male",20,50),
          (6,"Swaroop","Male",23,65),
          (7,"Reshabh","Male",19,70),
          (8,"Dinesh","Male",20,75),
          (9,"Rohit","Male",21,85),
          (10,"Sanjana","Female",22,87)]
 
  schm = ["Id","Name","Gender","Age","Percentage"]
 
  # calling function to create dataframe
  df = create_df(spark,input_data,schm)
  df.show()
 
  # extracting number of rows from the Dataframe
  row = df.count()
  # extracting number of columns from the Dataframe using dtypes function
  col = len(df.dtypes)
   
  # printing
  print(f'Dimension of the Dataframe is: {(row,col)}')
  print(f'Number of Rows are: {row}')
  print(f'Number of Columns are: {col}')

Output:

Example 4: Getting the dimension of the PySpark Dataframe by converting PySpark Dataframe to Pandas Dataframe.

In the example code, after creating the Dataframe, we are converting the PySpark Dataframe to Pandas Dataframe using toPandas() function by writing df.toPandas(). After converting the dataframe we are using Pandas function shape for getting the dimension of the Dataframe. This shape function returns the tuple, so for printing the number of row and column individually.

Python

# importing necessary libraries
from pyspark.sql import SparkSession
 
# function to create SparkSession
def create_session():
  spk = SparkSession.builder \
      .master("local") \
      .appName("Student_report.com") \
      .getOrCreate()
  return spk
 
# function to create Dataframe
def create_df(spark,data,schema):
  df1 = spark.createDataFrame(data,schema)
  return df1
 
# main function
if __name__ == "__main__":
 
  # calling function to create SparkSession
  spark = create_session()
     
  input_data = [(1,"Shivansh","Male",20,80),
          (2,"Arpita","Female",18,66),
          (3,"Raj","Male",21,90),
          (4,"Swati","Female",19,91),
          (5,"Arpit","Male",20,50),
          (6,"Swaroop","Male",23,65),
          (7,"Reshabh","Male",19,70),
          (8,"Dinesh","Male",20,75),
          (9,"Rohit","Male",21,85),
          (10,"Sanjana","Female",22,87)]
 
  schm = ["Id","Name","Gender","Age","Percentage"]
 
  # calling function to create dataframe
  df = create_df(spark,input_data,schm)
  df.show()
 
  # converting PySpark df to Pandas df using
  # toPandas() function
  new_df = df.toPandas()
   
  # using Pandas shape function for getting the
  # dimension of the df
  dimension = new_df.shape
 
  # printing
  print("Dimension of the Dataframe is: ",dimension)
  print(f'Number of Rows are: {dimension[0]}')
  print(f'Number of Columns are: {dimension[1]}')

Output:

Get number of rows and columns of PySpark dataframe

Python

Python

Python

Python

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

One UI 7: Everything you need to know

Review: The Ulefone Armor Mini 20T Pro makes other rugged phones seem flimsy

Best midrange Android phones in 2024

I tried a Xiaomi mid-ranger for the first time in years, and I’m glad the Pixel 8a exists in the US

Recent Comments

EDITOR PICKS

One UI 7: Everything you need to know

Review: The Ulefone Armor Mini 20T Pro makes other rugged phones seem flimsy

Best midrange Android phones in 2024

POPULAR POSTS

One UI 7: Everything you need to know

Review: The Ulefone Armor Mini 20T Pro makes other rugged phones seem flimsy

Best midrange Android phones in 2024

POPULAR CATEGORY

ABOUT US

FOLLOW US