Convert PySpark DataFrame to Dictionary in Python

28 July 2024

0

In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values.

Before starting, we will create a sample Dataframe:

Python3

# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Ram'), '1991-04-01', 'M', 3000),
        (('Mike'), '2000-05-19', 'M', 4000),
        (('Rohini'), '1978-09-05', 'M', 4000),
        (('Maria'), '1967-12-01', 'F', 4000),
        (('Jenis'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df = spark.createDataFrame(data=data,
                           schema=columns)
  
# Print the dataframe
df.show()

Output :

Method 1: Using df.toPandas()

Convert the PySpark data frame to Pandas data frame using df.toPandas().

Syntax: DataFrame.toPandas()

Return type: Returns the pandas data frame having the same content as Pyspark Dataframe.

Get through each column value and add the list of values to the dictionary with the column name as the key.

Python3

# Declare an empty Dictionary
dict = {}
  
# Convert PySpark DataFrame to Pandas 
# DataFrame
df = df.toPandas()
  
# Traverse through each column
for column in df.columns:
  
    # Add key as column_name and
    # value as list of column values
    dict[column] = df[column].values.tolist()
  
# Print the dictionary
print(dict)

Output :

{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’],

‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],

‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],

‘salary’: [3000, 4000, 4000, 4000, 1200]}

Method 2: Using df.collect()

Convert the PySpark data frame into the list of rows, and returns all the records of a data frame as a list.

Syntax: DataFrame.collect()

Return type: Returns all the records of the data frame as a list of rows.

Python3

import numpy as np
  
# Convert the dataframe into list
# of rows
rows = [list(row) for row in df.collect()]
  
# COnvert the list into numpy array
ar = np.array(rows)
  
# Declare an empty dictionary
dict = {}
  
# Get through each column
for i, column in enumerate(df.columns):
  
    # Add ith column as values in dict
    # with key as ith column_name
    dict[column] = list(ar[:, i])
  
# Print the dictionary
print(dict)

Output :

{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’],

‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],

‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],

‘salary’: [‘3000’, ‘4000’, ‘4000’, ‘4000’, ‘1200’]}

Method 3: Using pandas.DataFrame.to_dict()

Pandas data frame can be directly converted into a dictionary using the to_dict() method

Syntax: DataFrame.to_dict(orient=’dict’,)

Parameters:

orient: Indicating the type of values of the dictionary. It takes values such as {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}

Return type: Returns the dictionary corresponding to the data frame.

Code:

Python3

# COnvert PySpark dataframe to pandas
# dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient = 'list')
  
# Print the dictionary
print(dict)

Output :

{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’],

‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],

‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],

‘salary’: [3000, 4000, 4000, 4000, 1200]}

Converting a data frame having 2 columns to a dictionary, create a data frame with 2 columns naming ‘Location’ and ‘House_price’

Python3

# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Hyderabad'), 120000),
        (('Delhi'), 124000),
        (('Mumbai'), 344000),
        (('Guntur'), 454000),
        (('Bandra'), 111200)]
  
# Column names in dataframe
columns = ["Location", 'House_price']
  
# Create the spark dataframe
df = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
print('Dataframe : ')
df.show()
  
# COnvert PySpark dataframe to 
# pandas dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient='list')
  
# Print the dictionary
print('Dictionary :')
print(dict)

Output :

Convert PySpark DataFrame to Dictionary in Python

Python3

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Recent Comments

EDITOR PICKS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR POSTS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR CATEGORY

ABOUT US

FOLLOW US