Adding two columns to existing PySpark DataFrame using withColumn

In this article, we are going to see how to add two columns to the existing Pyspark Dataframe using WithColumns.

WithColumns is used to change the value, convert the datatype of an existing column, create a new column, and many more.

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name.

Example 1: Creating Dataframe and then add two columns.

Here we are going to create a dataframe from a list of the given dataset.

Python3

# Create a spark session 
from pyspark.sql import SparkSession 
spark = SparkSession.builder.appName('SparkExamples').getOrCreate() 
  
# Create a spark dataframe 
columns = ["Name", "Course_Name", 
           "Months", 
           "Course_Fees", "Discount", 
           "Start_Date", "Payment_Done"] 
data = [ 
    ("Amit Pathak", "Python", 3, 10000, 1000, 
     "02-07-2021", True), 
    ("Shikhar Mishra", "Soft skills", 2, 
     8000, 800, "07-10-2021", False), 
    ("Shivani Suvarna", "Accounting", 6, 
     15000, 1500, "20-08-2021", True), 
    ("Pooja Jain", "Data Science", 12, 
     60000, 900, "02-12-2021", False), 
] 
  
df = spark.createDataFrame(data).toDF(*columns) 
  
# View the dataframe 
df.show() 

Output:

Now Add the columns:

Here, we create two-column based on the existing columns.

Python3

new_df = df.withColumn('After_discount', 
                       df.Course_Fees - df.Discount).withColumn('Before_discount', 
                                                                df.Course_Fees) 
new_df.show()

Output:

Example 2: Creating Dataframe from csv and then add the columns.

Here we will use the cricket_data_set_odi.csv file as a dataset and create dataframe from this file.

Creating Dataframe for demonstration:

Python3

# import pandas to read json file 
import pandas as pd 
  
# importing module 
import pyspark 
  
# importing sparksession from pyspark.sql 
# module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
  
  
# create Dataframe 
df = spark.read.option("header",True).csv("Cricket_data_set_odi.csv") 
  
# Display Schema 
df.printSchema() 
  
# Show Dataframe 
df.show() 

Output:

Then, Adding the columns in an existing Dataframe:

Python3

new_df = df.withColumn( 
    'Hundred_run', df.Hundreds*100).withColumn( 
    'Avg_run', df.Runs / df.Matches) 
  
new_df.show()

Output:

Adding two columns to existing PySpark DataFrame using withColumn

Python3

Python3

Python3

Python3

How to Customize Line Graph in Jupyter Notebook

Differences between node.js and Tornado

NumPy ufuncs – Logs

LEAVE A REPLY Cancel reply

Most Popular

Must Do Coding Questions for Product Based Companies

Algorithm to solve Rubik’s Cube

Top 10 Projects For Beginners To Practice HTML and CSS Skills

30 OOPs Interview Questions and Answers (2024)

Recent Comments

EDITOR PICKS

Convert Docx to Pdf using docx2pdf Module in Python

Amazon Interview | Set 87 (For SDE)

Python script to generate dotted text from any image

POPULAR POSTS

Suffix Array | Set 1 (Introduction)

Javascript Program for Diagonally Dominant Matrix

Python3 Program for Range LCM Queries

POPULAR CATEGORY

ABOUT US

FOLLOW US