How to create an empty PySpark DataFrame ?

28 July 2024

0

In this article, we are going to see how to create an empty PySpark dataframe. Empty Pysaprk dataframe is a dataframe containing no data and may or may not specify the schema of the dataframe.

Creating an empty RDD without schema

We’ll first create an empty RDD by specifying an empty schema.

emptyRDD() method creates an RDD without any data.
createDataFrame() method creates a pyspark dataframe with the specified data and schema of the dataframe.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an empty RDD
emp_RDD = spark.sparkContext.emptyRDD()
 
# Create empty schema
columns = StructType([])
 
# Create an empty RDD with empty schema
data = spark.createDataFrame(data = emp_RDD,
                             schema = columns)
 
# Print the dataframe
print('Dataframe :')
data.show()
 
# Print the schema
print('Schema :')
data.printSchema()

Output:

Dataframe :
++
||
++
++

Schema :
root

Creating an emptyRDD with schema

It is possible that we will not get a file for processing. However, we must still manually create a DataFrame with the appropriate schema.

Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
Create an empty RDD with an expecting schema.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an empty RDD
emp_RDD = spark.sparkContext.emptyRDD()
 
# Create an expected schema
columns = StructType([StructField('Name',
                                  StringType(), True),
                    StructField('Age',
                                StringType(), True),
                    StructField('Gender',
                                StringType(), True)])
 
# Create an empty RDD with expected schema
df = spark.createDataFrame(data = emp_RDD,
                           schema = columns)
 
# Print the dataframe
print('Dataframe :')
df.show()
 
# Print the schema
print('Schema :')
df.printSchema()

Output :

Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+

Schema :
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)

Creating an empty dataframe without schema

Create an empty schema as columns.
Specify data as empty([]) and schema as columns in CreateDataFrame() method.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an empty schema
columns = StructType([])
 
# Create an empty dataframe with empty schema
df = spark.createDataFrame(data = [],
                           schema = columns)
 
# Print the dataframe
print('Dataframe :')
df.show()
 
# Print the schema
print('Schema :')
df.printSchema()

Output:

Dataframe :
++
||
++
++

Schema :
root

Creating an empty dataframe with schema

Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
Specify data as empty([]) and schema as columns in CreateDataFrame() method.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an expected schema
columns = StructType([StructField('Name',
                                  StringType(), True),
                    StructField('Age',
                                StringType(), True),
                    StructField('Gender',
                                StringType(), True)])
 
# Create a dataframe with expected schema
df = spark.createDataFrame(data = [],
                           schema = columns)
 
# Print the dataframe
print('Dataframe :')
df.show()
 
# Print the schema
print('Schema :')
df.printSchema()

Output :

Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+

Schema :
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)

How to create an empty PySpark DataFrame ?

Creating an empty RDD without schema

Python3

Creating an emptyRDD with schema

Python3

Creating an empty dataframe without schema

Python3

Creating an empty dataframe with schema

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

Google wants to hear your thoughts on the Android 15 QPR2 Beta

Recent Comments

EDITOR PICKS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR POSTS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR CATEGORY

ABOUT US

FOLLOW US