Thursday, December 26, 2024
Google search engine
HomeLanguagesDataFrame to JSON Array in Spark in Python

DataFrame to JSON Array in Spark in Python

In this article, we are going to see how to convert a data frame to JSON Array using Pyspark in Python.

In Apache Spark, a data frame is a distributed collection of data organized into named columns. It is similar to a spreadsheet or a SQL table, with rows and columns. You can use a data frame to store and manipulate tabular data in a distributed environment. DataFrames are designed to be expressive, efficient, and flexible, and they are a key component of Spark’s Structured Streaming API.

What is JSON array?

A JSON (JavaScript Object Notation) array is a data structure that consists of an ordered list of values. It is often used to transmit data between a server and a web application, or between two different applications. JSON arrays are written in a syntax similar to that of JavaScript arrays, with square brackets containing a list of values separated by commas.

Methods to convert a DataFrame to a JSON array in Pyspark:

  • Use the .toJSON() method 
  • Using the toPandas() method
  • Using the write.json() method

Method 1: Use the .toJSON() method 

The toJSON() method in Pyspark is used to convert pandas data frame to a JSON object. This method takes a number of arguments that allow you to specify the format of the resulting JSON object, such as the index, the data type of the values, and whether to include the column labels in the output.

Stepwise implementation:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Create a spark session using the getOrCreate() function.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

Step 3: Create a data frame with sample data.

df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
                           ["id", "name", "age"])

Step 4: Print the data frame.

df.show()

Step 5: Use the .toJSON() function and convert data frame to JSON array

json_array = df.toJSON().collect()

Step 6: Finally, print the JSON array.

print("JSON array:",json_array)

Example:

In this example, we have created a data frame with three columns id, name, and age and converted it to a JSON array using toJSON() method. In the output, the data frame as well as the JSON array are printed.

Python3




# Importing SparkSession
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
  
# Create a DataFrame
df = spark.createDataFrame([(1, "Alice", 10),
                            (2, "Bob", 20),
                            (3, "Charlie", 30)],
                           ["id", "name", "age"])
  
print("Dataframe: ")
df.show()
  
# Convert the DataFrame to a JSON array
json_array = df.toJSON().collect()
  
# Print the JSON array
print("JSON array:",json_array)


Output:

Dataframe: 
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 10|
|  2|    Bob| 20|
|  3|Charlie| 30|
+---+-------+---+

JSON array:
[
  '{"id":1,"name":"Alice","age":10}',
  '{"id":2,"name":"Bob","age":20}',
  '{"id":3,"name":"Charlie","age":30}'
]

Method 2: Using the toPandas() method

In this method, we will convert the spark data frame to a pandas data frame which has three columns id, name, and age, and then going to convert it to a JSON string using the to_json() method. 

Stepwise implementation:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Create a spark session using the getOrCreate() function.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

Step 3: Create a data frame with sample data.

df = spark.createDataFrame([(1, "Alice", 10), (2, "Bob", 20), (3, "Charlie", 30)],
                           ["id", "name", "age"])

Step 4: Print the data frame

df.show()

Step 5: Convert the data frame to a pandas data frame

pandas_df = df.toPandas()

Step 6: Convert the panda’s data frame to JSON array

json_data = pandas_df.to_json(orient='records')

Step 7: Finally, print the JSON array

print("JSON array:",json_data)

Example:

Python3




# Importing SparkSession
from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
  
# Create a DataFrame
df = spark.createDataFrame([(1, "Joseph", 10),
                             (2, "Jack", 20),
                            (3, "Elon", 30)],
                             ["id", "name", "age"])
  
print("Dataframe: ")
df.show()
  
# Convert dataframe to a Pandas dataframe
pandas_df = df.toPandas()
  
# Convert Pandas dataframe to a JSON string
json_data = pandas_df.to_json(orient='records')
  
# Print the JSON array
print("JSON array:", json_data)


Output:

Dataframe: 
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1| Joseph| 10|
|  2|   Jack| 20|
|  3|   Elon| 30|
+---+-------+---+

JSON array:
[
  '{"id":1,"name":"Joseph","age":10}',
  '{"id":2,"name":"Jack","age":20}',
  '{"id":3,"name":"Elon","age":30}'
]

Method 3: Using the write.json() method

In this method, we will use write.json() to create a JSON file. But this will create a directory called data.json that contains a set of files with the JSON data. If we want to write the JSON data to a single file, we can use the coalesce() method to merge the files and then use the write.json() method.

Stepwise implementation:

Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.

from pyspark.sql import SparkSession

Step 2: Create a spark session using the getOrCreate() function.

spark = SparkSession.builder.appName("MyApp").getOrCreate()

Step 3: Create a data frame with sample data.

df = spark.createDataFrame([(1, "Alice", 10),
                            (2, "Bob", 20),
                            (3, "Charlie", 30)], 
                            ["id", "name", "age"])

Step 4: Use write.json() to create a JSON directory

df.write.json('data.json')

Step 5: Finally, merge the JSON files into a single JSON file.

df.coalesce(1).write.json('data_merged.json')

Example:

In this example, we created a data frame with three columns id, name, and age. The code below creates a directory with multiple JSON files which are then merged into a single file. The output is finally stored in a JSON file named data_merge.json.

Python3




from pyspark.sql import SparkSession
  
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
  
# Create a DataFrame
df = spark.createDataFrame([(1, "Donald", 51),
                            (2, "Riya", 23), 
                            (3, "Vani", 22)], 
                           ["id", "name", "age"])
  
# Write dataframe to a JSON file
df.write.json('data.json')
  
# Merge JSON files into a single file
df.coalesce(1).write.json('data_merged.json')
  
# Printing data frame
df.show()


Output:

[{"id":1,"name":"Donald","age":51},
{"id":2,"name":"Riya","age":23},
{"id":3,"name":"Vani","age":22}]

RELATED ARTICLES

Most Popular

Recent Comments