In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values.
Before starting, we will create a sample Dataframe:
Python3
# Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName( 'DF_to_dict' ).getOrCreate() # Create data in dataframe data = [(( 'Ram' ), '1991-04-01' , 'M' , 3000 ), (( 'Mike' ), '2000-05-19' , 'M' , 4000 ), (( 'Rohini' ), '1978-09-05' , 'M' , 4000 ), (( 'Maria' ), '1967-12-01' , 'F' , 4000 ), (( 'Jenis' ), '1980-02-17' , 'F' , 1200 )] # Column names in dataframe columns = [ "Name" , "DOB" , "Gender" , "salary" ] # Create the spark dataframe df = spark.createDataFrame(data = data, schema = columns) # Print the dataframe df.show() |
Output :
Method 1: Using df.toPandas()
Convert the PySpark data frame to Pandas data frame using df.toPandas().
Syntax: DataFrame.toPandas()
Return type: Returns the pandas data frame having the same content as Pyspark Dataframe.
Get through each column value and add the list of values to the dictionary with the column name as the key.
Python3
# Declare an empty Dictionary dict = {} # Convert PySpark DataFrame to Pandas # DataFrame df = df.toPandas() # Traverse through each column for column in df.columns: # Add key as column_name and # value as list of column values dict [column] = df[column].values.tolist() # Print the dictionary print ( dict ) |
Output :
{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’],
‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],
‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],
‘salary’: [3000, 4000, 4000, 4000, 1200]}
Method 2: Using df.collect()
Convert the PySpark data frame into the list of rows, and returns all the records of a data frame as a list.
Syntax: DataFrame.collect()
Return type: Returns all the records of the data frame as a list of rows.
Python3
import numpy as np # Convert the dataframe into list # of rows rows = [ list (row) for row in df.collect()] # COnvert the list into numpy array ar = np.array(rows) # Declare an empty dictionary dict = {} # Get through each column for i, column in enumerate (df.columns): # Add ith column as values in dict # with key as ith column_name dict [column] = list (ar[:, i]) # Print the dictionary print ( dict ) |
Output :
{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’],
‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],
‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],
‘salary’: [‘3000’, ‘4000’, ‘4000’, ‘4000’, ‘1200’]}
Method 3: Using pandas.DataFrame.to_dict()
Pandas data frame can be directly converted into a dictionary using the to_dict() method
Syntax: DataFrame.to_dict(orient=’dict’,)
Parameters:
- orient: Indicating the type of values of the dictionary. It takes values such as {‘dict’, ‘list’, ‘series’, ‘split’, ‘records’, ‘index’}
Return type: Returns the dictionary corresponding to the data frame.
Code:
Python3
# COnvert PySpark dataframe to pandas # dataframe df = df.toPandas() # Convert the dataframe into # dictionary dict = df.to_dict(orient = 'list' ) # Print the dictionary print ( dict ) |
Output :
{‘Name’: [‘Ram’, ‘Mike’, ‘Rohini’, ‘Maria’, ‘Jenis’],
‘DOB’: [‘1991-04-01’, ‘2000-05-19’, ‘1978-09-05’, ‘1967-12-01’, ‘1980-02-17’],
‘Gender’: [‘M’, ‘M’, ‘M’, ‘F’, ‘F’],
‘salary’: [3000, 4000, 4000, 4000, 1200]}
Converting a data frame having 2 columns to a dictionary, create a data frame with 2 columns naming ‘Location’ and ‘House_price’
Python3
# Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName( 'DF_to_dict' ).getOrCreate() # Create data in dataframe data = [(( 'Hyderabad' ), 120000 ), (( 'Delhi' ), 124000 ), (( 'Mumbai' ), 344000 ), (( 'Guntur' ), 454000 ), (( 'Bandra' ), 111200 )] # Column names in dataframe columns = [ "Location" , 'House_price' ] # Create the spark dataframe df = spark.createDataFrame(data = data, schema = columns) # Print the dataframe print ( 'Dataframe : ' ) df.show() # COnvert PySpark dataframe to # pandas dataframe df = df.toPandas() # Convert the dataframe into # dictionary dict = df.to_dict(orient = 'list' ) # Print the dictionary print ( 'Dictionary :' ) print ( dict ) |
Output :