How to create an empty PySpark DataFrame ?

28 July 2024

0

In this article, we are going to see how to create an empty PySpark dataframe. Empty Pysaprk dataframe is a dataframe containing no data and may or may not specify the schema of the dataframe.

Creating an empty RDD without schema

We’ll first create an empty RDD by specifying an empty schema.

emptyRDD() method creates an RDD without any data.
createDataFrame() method creates a pyspark dataframe with the specified data and schema of the dataframe.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an empty RDD
emp_RDD = spark.sparkContext.emptyRDD()
 
# Create empty schema
columns = StructType([])
 
# Create an empty RDD with empty schema
data = spark.createDataFrame(data = emp_RDD,
                             schema = columns)
 
# Print the dataframe
print('Dataframe :')
data.show()
 
# Print the schema
print('Schema :')
data.printSchema()

Output:

Dataframe :
++
||
++
++

Schema :
root

Creating an emptyRDD with schema

It is possible that we will not get a file for processing. However, we must still manually create a DataFrame with the appropriate schema.

Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
Create an empty RDD with an expecting schema.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an empty RDD
emp_RDD = spark.sparkContext.emptyRDD()
 
# Create an expected schema
columns = StructType([StructField('Name',
                                  StringType(), True),
                    StructField('Age',
                                StringType(), True),
                    StructField('Gender',
                                StringType(), True)])
 
# Create an empty RDD with expected schema
df = spark.createDataFrame(data = emp_RDD,
                           schema = columns)
 
# Print the dataframe
print('Dataframe :')
df.show()
 
# Print the schema
print('Schema :')
df.printSchema()

Output :

Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+

Schema :
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)

Creating an empty dataframe without schema

Create an empty schema as columns.
Specify data as empty([]) and schema as columns in CreateDataFrame() method.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an empty schema
columns = StructType([])
 
# Create an empty dataframe with empty schema
df = spark.createDataFrame(data = [],
                           schema = columns)
 
# Print the dataframe
print('Dataframe :')
df.show()
 
# Print the schema
print('Schema :')
df.printSchema()

Output:

Dataframe :
++
||
++
++

Schema :
root

Creating an empty dataframe with schema

Specify the schema of the dataframe as columns = [‘Name’, ‘Age’, ‘Gender’].
Specify data as empty([]) and schema as columns in CreateDataFrame() method.

Code:

Python3

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.types import *
 
# Create a spark session
spark = SparkSession.builder.appName('Empty_Dataframe').getOrCreate()
 
# Create an expected schema
columns = StructType([StructField('Name',
                                  StringType(), True),
                    StructField('Age',
                                StringType(), True),
                    StructField('Gender',
                                StringType(), True)])
 
# Create a dataframe with expected schema
df = spark.createDataFrame(data = [],
                           schema = columns)
 
# Print the dataframe
print('Dataframe :')
df.show()
 
# Print the schema
print('Schema :')
df.printSchema()

Output :

Dataframe :
+----+---+------+
|Name|Age|Gender|
+----+---+------+
+----+---+------+

Schema :
root
 |-- Name: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Gender: string (nullable = true)

How to create an empty PySpark DataFrame ?

Creating an empty RDD without schema

Python3

Creating an emptyRDD with schema

Python3

Creating an empty dataframe without schema

Python3

Creating an empty dataframe with schema

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best Malwarebytes Alternatives in 2024: Top Choices by Manual Thomas

3 Best Antiviruses for Amazon Fire in 2024: Tested by Sam Boyd

5 Best Free Firewall Programs in 2024: Safe & Secure by Tyler Cross

This coveted Galaxy S25 Ultra feature might not hit any other phones all year, again

Recent Comments

EDITOR PICKS

5 Best Malwarebytes Alternatives in 2024: Top Choices by Manual Thomas

3 Best Antiviruses for Amazon Fire in 2024: Tested by Sam Boyd

5 Best Free Firewall Programs in 2024: Safe & Secure by Tyler Cross

POPULAR POSTS

5 Best Malwarebytes Alternatives in 2024: Top Choices by Manual Thomas

3 Best Antiviruses for Amazon Fire in 2024: Tested by Sam Boyd

5 Best Free Firewall Programs in 2024: Safe & Secure by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US