Thursday, December 26, 2024
Google search engine
HomeLanguagesConvert PySpark DataFrame to Dictionary in Python

Convert PySpark DataFrame to Dictionary in Python

In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values.

Before starting, we will create a sample Dataframe:

Python3




# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Ram'), '1991-04-01', 'M', 3000),
        (('Mike'), '2000-05-19', 'M', 4000),
        (('Rohini'), '1978-09-05', 'M', 4000),
        (('Maria'), '1967-12-01', 'F', 4000),
        (('Jenis'), '1980-02-17', 'F', 1200)]
  
# Column names in dataframe
columns = ["Name", "DOB", "Gender", "salary"]
  
# Create the spark dataframe
df = spark.createDataFrame(data=data,
                           schema=columns)
  
# Print the dataframe
df.show()


Output :

Method 1: Using df.toPandas()

Convert the PySpark data frame to Pandas data frame using df.toPandas().

Syntax: DataFrame.toPandas()

Return type: Returns the pandas data frame having the same content as Pyspark Dataframe.

Get through each column value and add the list of values to the dictionary with the column name as the key.

Python3




# Declare an empty Dictionary
dict = {}
  
# Convert PySpark DataFrame to Pandas 
# DataFrame
df = df.toPandas()
  
# Traverse through each column
for column in df.columns:
  
    # Add key as column_name and
    # value as list of column values
    dict[column] = df[column].values.tolist()
  
# Print the dictionary
print(dict)


Output :

{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’], 

‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],

 ‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’], 

 ‘salary’: [3000, 4000, 4000, 4000, 1200]}

Method 2: Using df.collect()

Convert the PySpark data frame into the list of rows, and returns all the records of a data frame as a list.

Syntax: DataFrame.collect()

Return type:  Returns all the records of the data frame as a list of rows.

Python3




import numpy as np
  
# Convert the dataframe into list
# of rows
rows = [list(row) for row in df.collect()]
  
# COnvert the list into numpy array
ar = np.array(rows)
  
# Declare an empty dictionary
dict = {}
  
# Get through each column
for i, column in enumerate(df.columns):
  
    # Add ith column as values in dict
    # with key as ith column_name
    dict[column] = list(ar[:, i])
  
# Print the dictionary
print(dict)


Output :

{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’], 

‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’], 

‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],

 ‘salary’: [‘3000’, ‘4000’, ‘4000’, ‘4000’, ‘1200’]}

Method 3: Using pandas.DataFrame.to_dict()

Pandas data frame can be directly converted into a dictionary using the to_dict() method

Syntax: DataFrame.to_dict(orient=’dict’,)

Parameters: 

  • orient: Indicating the type of values of the dictionary. It takes values such as {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}

Return type: Returns the dictionary corresponding to the data frame.

Code:

Python3




# COnvert PySpark dataframe to pandas
# dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient = 'list')
  
# Print the dictionary
print(dict)


Output :

{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’], 

‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’], 

‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],

 ‘salary’: [3000, 4000, 4000, 4000, 1200]}

Converting a data frame having 2 columns to a dictionary, create a data frame with 2 columns naming ‘Location’ and ‘House_price’

Python3




# Importing necessary libraries
from pyspark.sql import SparkSession
  
# Create a spark session
spark = SparkSession.builder.appName('DF_to_dict').getOrCreate()
  
# Create data in dataframe
data = [(('Hyderabad'), 120000),
        (('Delhi'), 124000),
        (('Mumbai'), 344000),
        (('Guntur'), 454000),
        (('Bandra'), 111200)]
  
# Column names in dataframe
columns = ["Location", 'House_price']
  
# Create the spark dataframe
df = spark.createDataFrame(data=data, schema=columns)
  
# Print the dataframe
print('Dataframe : ')
df.show()
  
# COnvert PySpark dataframe to 
# pandas dataframe
df = df.toPandas()
  
# Convert the dataframe into 
# dictionary
dict = df.to_dict(orient='list')
  
# Print the dictionary
print('Dictionary :')
print(dict)


Output :

RELATED ARTICLES

Most Popular

Recent Comments