Sunday, November 17, 2024
Google search engine
HomeLanguagesHow to Add Multiple Columns in PySpark Dataframes ?

How to Add Multiple Columns in PySpark Dataframes ?

In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. 

Let’s create a sample dataframe for demonstration:

Dataset Used: Cricket_data_set_odi

Python3




# import pandas to read json file
import pandas as pd
  
# importing module
import pyspark
  
# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession
  
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
  
  
# create Dataframe
df=spark.read.option(
    "header",True).csv("Cricket_data_set_odi.csv")
  
# Display Schema
df.printSchema()
  
# Show Dataframe
df.show()


Output:

Method 1: Using withColumn()

withColumn() is used to add a new or update an existing column on DataFrame

Syntax: df.withColumn(colName, col)

Returns: A new :class:`DataFrame` by adding a column or replacing the existing column that has the same name. 

Code:

Python3




df.withColumn(
    'Avg_runs', df.Runs / df.Matches).withColumn(
    'wkt+10', df.Wickets+10).show()


Output:

Method 2: Using select()

You can also add multiple columns using select.

Syntax: df.select(*cols)

Code:

Python3




# Using select() to Add Multiple Column
df.select('*', (df.Runs / df.Matches).alias('Avg_runs'),
          (df.Wickets+10).alias('wkt+10')).show()


Output :

Method 3: Adding a Constant multiple Column to DataFrame Using withColumn() and select()

Let’s create a new column with constant value using lit() SQL function, on the below code. The lit() function present in Pyspark is used to add a new column in a Pyspark Dataframe by assigning a constant or literal value.

Python3




from pyspark.sql.functions import col, lit
  
  
df.select('*',lit("Cricket").alias("Sport")).
withColumn("Fitness",lit(("Good"))).show()


Output:

RELATED ARTICLES

Most Popular

Recent Comments