In this article, we are going to see how to convert a data frame to JSON Array using Pyspark in Python.
In Apache Spark, a data frame is a distributed collection of data organized into named columns. It is similar to a spreadsheet or a SQL table, with rows and columns. You can use a data frame to store and manipulate tabular data in a distributed environment. DataFrames are designed to be expressive, efficient, and flexible, and they are a key component of Spark’s Structured Streaming API.
What is JSON array?
A JSON (JavaScript Object Notation) array is a data structure that consists of an ordered list of values. It is often used to transmit data between a server and a web application, or between two different applications. JSON arrays are written in a syntax similar to that of JavaScript arrays, with square brackets containing a list of values separated by commas.
Methods to convert a DataFrame to a JSON array in Pyspark:
- Use the .toJSON() method
- Using the toPandas() method
- Using the write.json() method
Method 1: Use the .toJSON() method
The toJSON() method in Pyspark is used to convert pandas data frame to a JSON object. This method takes a number of arguments that allow you to specify the format of the resulting JSON object, such as the index, the data type of the values, and whether to include the column labels in the output.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)], ["id", "name", "age"])
Step 4: Print the data frame.
df.show()
Step 5: Use the .toJSON() function and convert data frame to JSON array
json_array = df.toJSON().collect()
Step 6: Finally, print the JSON array.
print("JSON array:",json_array)
Example:
In this example, we have created a data frame with three columns id, name, and age and converted it to a JSON array using toJSON() method. In the output, the data frame as well as the JSON array are printed.
Python3
# Importing SparkSession from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName( "MyApp" ).getOrCreate() # Create a DataFrame df = spark.createDataFrame([( 1 , "Alice" , 10 ), ( 2 , "Bob" , 20 ), ( 3 , "Charlie" , 30 )], [ "id" , "name" , "age" ]) print ( "Dataframe: " ) df.show() # Convert the DataFrame to a JSON array json_array = df.toJSON().collect() # Print the JSON array print ( "JSON array:" ,json_array) |
Output:
Dataframe: +---+-------+---+ | id| name|age| +---+-------+---+ | 1| Alice| 10| | 2| Bob| 20| | 3|Charlie| 30| +---+-------+---+ JSON array: [ '{"id":1,"name":"Alice","age":10}', '{"id":2,"name":"Bob","age":20}', '{"id":3,"name":"Charlie","age":30}' ]
Method 2: Using the toPandas() method
In this method, we will convert the spark data frame to a pandas data frame which has three columns id, name, and age, and then going to convert it to a JSON string using the to_json() method.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)], ["id", "name", "age"])
Step 4: Print the data frame
df.show()
Step 5: Convert the data frame to a pandas data frame
pandas_df = df.toPandas()
Step 6: Convert the panda’s data frame to JSON array
json_data = pandas_df.to_json(orient='records')
Step 7: Finally, print the JSON array
print("JSON array:",json_data)
Example:
Python3
# Importing SparkSession from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName( "MyApp" ).getOrCreate() # Create a DataFrame df = spark.createDataFrame([( 1 , "Joseph" , 10 ), ( 2 , "Jack" , 20 ), ( 3 , "Elon" , 30 )], [ "id" , "name" , "age" ]) print ( "Dataframe: " ) df.show() # Convert dataframe to a Pandas dataframe pandas_df = df.toPandas() # Convert Pandas dataframe to a JSON string json_data = pandas_df.to_json(orient = 'records' ) # Print the JSON array print ( "JSON array:" , json_data) |
Output:
Dataframe: +---+-------+---+ | id| name|age| +---+-------+---+ | 1| Joseph| 10| | 2| Jack| 20| | 3| Elon| 30| +---+-------+---+ JSON array: [ '{"id":1,"name":"Joseph","age":10}', '{"id":2,"name":"Jack","age":20}', '{"id":3,"name":"Elon","age":30}' ]
Method 3: Using the write.json() method
In this method, we will use write.json() to create a JSON file. But this will create a directory called data.json that contains a set of files with the JSON data. If we want to write the JSON data to a single file, we can use the coalesce() method to merge the files and then use the write.json() method.
Stepwise implementation:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark = SparkSession.builder.appName("MyApp").getOrCreate()
Step 3: Create a data frame with sample data.
df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)], ["id", "name", "age"])
Step 4: Use write.json() to create a JSON directory
df.write.json('data.json')
Step 5: Finally, merge the JSON files into a single JSON file.
df.coalesce(1).write.json('data_merged.json')
Example:
In this example, we created a data frame with three columns id, name, and age. The code below creates a directory with multiple JSON files which are then merged into a single file. The output is finally stored in a JSON file named data_merge.json.
Python3
from pyspark.sql import SparkSession # Create a SparkSession spark = SparkSession.builder.appName( "MyApp" ).getOrCreate() # Create a DataFrame df = spark.createDataFrame([( 1 , "Donald" , 51 ), ( 2 , "Riya" , 23 ), ( 3 , "Vani" , 22 )], [ "id" , "name" , "age" ]) # Write dataframe to a JSON file df.write.json( 'data.json' ) # Merge JSON files into a single file df.coalesce( 1 ).write.json( 'data_merged.json' ) # Printing data frame df.show() |
Output:
[{"id":1,"name":"Donald","age":51}, {"id":2,"name":"Riya","age":23}, {"id":3,"name":"Vani","age":22}]