Thursday, December 26, 2024
Google search engine
HomeLanguagesAdding two columns to existing PySpark DataFrame using withColumn

Adding two columns to existing PySpark DataFrame using withColumn

In this article, we are going to see how to add two columns to the existing Pyspark Dataframe using WithColumns. 

WithColumns is used to change the value, convert the datatype of an existing column, create a new column, and many more.

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name. 

Example 1: Creating Dataframe and then add two columns.

Here we are going to create a dataframe from a list of the given dataset.

Python3




# Create a spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkExamples').getOrCreate()
  
# Create a spark dataframe
columns = ["Name", "Course_Name",
           "Months",
           "Course_Fees", "Discount",
           "Start_Date", "Payment_Done"]
data = [
    ("Amit Pathak", "Python", 3, 10000, 1000,
     "02-07-2021", True),
    ("Shikhar Mishra", "Soft skills", 2,
     8000, 800, "07-10-2021", False),
    ("Shivani Suvarna", "Accounting", 6,
     15000, 1500, "20-08-2021", True),
    ("Pooja Jain", "Data Science", 12,
     60000, 900, "02-12-2021", False),
]
  
df = spark.createDataFrame(data).toDF(*columns)
  
# View the dataframe
df.show()


Output:

Now Add the columns:

Here, we create two-column based on the existing columns.

Python3




new_df = df.withColumn('After_discount',
                       df.Course_Fees - df.Discount).withColumn('Before_discount',
                                                                df.Course_Fees)
new_df.show()


Output:

Example 2: Creating Dataframe from csv and then add the columns.

Here we will use the cricket_data_set_odi.csv file as a dataset and create dataframe from this file.

Creating Dataframe for demonstration:

Python3




# import pandas to read json file
import pandas as pd
  
# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
  
# create Dataframe
df = spark.read.option("header",True).csv("Cricket_data_set_odi.csv")
  
# Display Schema
df.printSchema()
  
# Show Dataframe
df.show()


Output:

Then, Adding the columns in an existing Dataframe:

Python3




new_df = df.withColumn(
    'Hundred_run', df.Hundreds*100).withColumn(
    'Avg_run', df.Runs / df.Matches)
  
new_df.show()


Output:

RELATED ARTICLES

Most Popular

Recent Comments