In this article, we are going to see how to append data to an empty DataFrame in PySpark in the Python programming language.
Method 1: Make an empty DataFrame and make a union with a non-empty DataFrame with the same schema
The union() function is the most important for this operation. It is used to mix two DataFrames that have an equivalent schema of the columns.
Syntax : FirstDataFrame.union(Second DataFrame)
Returns : DataFrame with rows of both DataFrames.
Example:
In this example, we create a DataFrame with a particular schema and data create an EMPTY DataFrame with the same scheme and do a union of these two DataFrames using the union() function in the python language.
Python
# Importing PySpark and the SparkSession # DataType functionality import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import * # Creating a spark session spark_session = SparkSession.builder.appName( 'Spark_Session' ).getOrCreate() # Creating an empty RDD to make a # DataFrame with no data emp_RDD = spark_session.sparkContext.emptyRDD() # Defining the schema of the DataFrame columns1 = StructType([StructField( 'Name' , StringType(), False ), StructField( 'Salary' , IntegerType(), False )]) # Creating an empty DataFrame first_df = spark_session.createDataFrame(data = emp_RDD, schema = columns1) # Printing the DataFrame with no data first_df.show() # Hardcoded data for the second DataFrame rows = [[ 'Ajay' , 56000 ], [ 'Srikanth' , 89078 ], [ 'Reddy' , 76890 ], [ 'Gursaidutt' , 98023 ]] columns = [ 'Name' , 'Salary' ] # Creating the DataFrame second_df = spark_session.createDataFrame(rows, columns) # Printing the non-empty DataFrame second_df.show() # Storing the union of first_df and # second_df in first_df first_df = first_df.union(second_df) # Our first DataFrame that was empty, # now has data first_df.show() |
Output :
+----+------+ |Name|Salary| +----+------+ +----+------+ +----------+------+ | Name|Salary| +----------+------+ | Ajay| 56000| | Srikanth| 89078| | Reddy| 76890| |Gursaidutt| 98023| +----------+------+ +----------+------+ | Name|Salary| +----------+------+ | Ajay| 56000| | Srikanth| 89078| | Reddy| 76890| |Gursaidutt| 98023| +----------+------+
Method 2: Add a singular row to an empty DataFrame by converting the row into a DataFrame
We can use createDataFrame() to convert a single row in the form of a Python List. The details of createDataFrame() are :
Syntax : CurrentSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Parameters :
data :
- schema : str/list , optional: Contains a String or List of column names.
- samplingRatio : float, optional: A sample of rows for inference
- verifySchema : bool, optional: Verify data types of every row against the specified schema. The value is True by default.
Example:
In this example, we create a DataFrame with a particular schema and single row and create an EMPTY DataFrame with the same schema using createDataFrame(), do a union of these two DataFrames using union() function further store the above result in the earlier empty DataFrame and use show() to see the changes.
Python
# Importing PySpark and the SparkSession, # DataType functionality import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import * # Creating a spark session spark_session = SparkSession.builder.appName( 'Spark_Session' ).getOrCreate() # Creating an empty RDD to make a DataFrame # with no data emp_RDD = spark_session.sparkContext.emptyRDD() # Defining the schema of the DataFrame columns = StructType([StructField( 'Stadium' , StringType(), False ), StructField( 'Capacity' , IntegerType(), False )]) # Creating an empty DataFrame df = spark_session.createDataFrame(data = emp_RDD, schema = columns) # Printing the DataFrame with no data df.show() # Hardcoded row for the second DataFrame added_row = [[ 'Motera Stadium' , 132000 ]] # Creating the DataFrame added_df = spark_session.createDataFrame(added_row, columns) # Storing the union of first_df and second_df # in first_df df = df.union(added_df) # Our first DataFrame that was empty, # now has data df.show() |
Output :
+-------+--------+ |Stadium|Capacity| +-------+--------+ +-------+--------+ +--------------+--------+ | Stadium|Capacity| +--------------+--------+ |Motera Stadium| 132000| +--------------+--------+
Method 3: Convert the empty DataFrame into a Pandas DataFrame and use the append() function
We will use toPandas() to convert PySpark DataFrame to Pandas DataFrame. Its syntax is :
Syntax : PySparkDataFrame.toPandas()
Returns : Corresponding Pandas DataFrame
We will then use the Pandas append() function. Its syntax is :
Syntax : PandasDataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)
Parameters :
- other : Pandas DataFrame, Numpy Series etc: The data that has to be appended.
- ignore_index : bool: If indexed a ignored then the indexes of the new DataFrame will have no relations to the older ones.
- sort : bool: Sort the columns if alignment of the columns in other and PandasDataFrame is different.
Example:
Here we create an empty DataFrame where data is to be added, then we convert the data to be added into a Spark DataFrame using createDataFrame() and further convert both DataFrames to a Pandas DataFrame using toPandas() and use the append() function to add the non-empty data frame to the empty DataFrame and ignore the indexes as we are getting a new DataFrame.Finally, we convert our final Pandas DataFrame to a Spark DataFrame using createDataFrame().
Python
# Importing PySpark and the SparkSession, # DataType functionality import pyspark from pyspark.sql import SparkSession from pyspark.sql.types import * # Creating a spark session spark_session = SparkSession.builder.appName( 'Spark_Session' ).getOrCreate() # Creating an empty RDD to make a DataFrame # with no data emp_RDD = spark_session.sparkContext.emptyRDD() # Defining the schema of the DataFrame columns = StructType([StructField( 'Stadium' , StringType(), False ), StructField( 'Capacity' , IntegerType(), False )]) # Creating an empty DataFrame df = spark_session.createDataFrame(data = emp_RDD, schema = columns) # Printing the DataFrame with no data df.show() # Hardcoded row for the second DataFrame added_row = [[ 'Motera Stadium' , 132000 ]] # Creating the DataFrame whose data # needs to be added added_df = spark_session.createDataFrame(added_row, columns) # converting our PySpark DataFrames to # Pandas DataFrames pandas_added = added_df.toPandas() df = df.toPandas() # using append() function to add the data df = df.append(pandas_added, ignore_index = True ) # reconverting our DataFrame back # to a PySpark DataFrame df = spark_session.createDataFrame(df) # Printing resultant DataFrame df.show() |
Output :
+-------+--------+ |Stadium|Capacity| +-------+--------+ +-------+--------+ +--------------+--------+ | Stadium|Capacity| +--------------+--------+ |Motera Stadium| 132000| +--------------+--------+