In this article, we are going to discuss the creation of Pyspark dataframe from the nested dictionary.
We will use the createDataFrame() method from pyspark for creating DataFrame. For this, we will use a list of nested dictionary and extract the pair as a key and value. Select the key, value pairs by mentioning the items() function from the nested dictionary
[Row(**{'': k, **v}) for k,v in data.items()]
Example 1:Python program to create college data with a dictionary with nested address in dictionary
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession from pyspark.sql import Row # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # creating nested dictionary data = { 'student_1' : { 'student id' : 7058 , 'country' : 'India' , 'state' : 'AP' , 'district' : 'Guntur' }, 'student_2' : { 'student id' : 7059 , 'country' : 'Srilanka' , 'state' : 'X' , 'district' : 'Y' } } # taking row data rowdata = [Row( * * {'': k, * * v}) for k, v in data.items()] # creating the pyspark dataframe final = spark.createDataFrame(rowdata).select( 'student id' , 'country' , 'state' , 'district' ) # display pyspark dataframe final.show() |
Output:
+----------+--------+-----+--------+ |student id| country|state|district| +----------+--------+-----+--------+ | 7058| India| AP| Guntur| | 7059|Srilanka| X| Y| +----------+--------+-----+--------+
Example 2: Python program to create nested dictionaries with 3 columns(3 keys)
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession from pyspark.sql import Row # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # creating nested dictionary data = { 'student_1' : { 'student id' : 7058 , 'country' : 'India' , 'state' : 'AP' }, 'student_2' : { 'student id' : 7059 , 'country' : 'Srilanka' , 'state' : 'X' } } # taking row data rowdata = [Row( * * {'': k, * * v}) for k, v in data.items()] # creating the pyspark dataframe final = spark.createDataFrame(rowdata).select( 'student id' , 'country' , 'state' ) # display pyspark dataframe final.show() |
Output:
+----------+--------+-----+ |student id| country|state| +----------+--------+-----+ | 7058| India| AP| | 7059|Srilanka| X| +----------+--------+-----+