PySpark – Adding a Column from a list of values using a UDF

20 June 2025

0

A data frame that is similar to a relational table in Spark SQL, and can be created using various functions in SparkSession is known as a Pyspark data frame. There occur various circumstances in which you get data in the list format but you need it in the form of a column in the data frame. If a similar situation has occurred with you, then you can do it easily by assigning increasing IDs to the data frame and then adding the values in a column. Read the article further to know more about it in detail.

PySpark – Adding a Column from a list of values using a UDF

Example 1:

In the example, we have created a data frame with three columns ‘Roll_Number‘, ‘Fees‘, and ‘Fine‘ as follows:

Once created, we assigned continuously increasing IDs to the data frame using the monotonically_increasing_id function. Also, we defined a list of values, i.e., student_names which need to be added as a column to a data frame. Then, with the UDF on increasing Id’s, we assigned values of the list as a column to the data frame and finally displayed the data frame after dropping the increasing Id’s column.

Python3

# PySpark - Adding a Column from a 
# list of values using a UDF 
  
# Import the libraries SparkSession, 
# functions, IntegerType, 
# StringType, row_number, 
# monotonically_increasing_id and Window 
from pyspark.sql import SparkSession, functions as F 
from pyspark.sql.types import IntegerType, StringType 
from pyspark.sql.functions import row_number, monotonically_increasing_id 
from pyspark.sql.window import Window 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Create a user defined function to assign student names 
# according to the row index by subtracting row number from 1 
labels_udf = F.udf(lambda indx: student_names[indx-1], StringType()) 
  
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine' 
df = spark_session.createDataFrame( 
    [(1, 10000, 400), (2, 14000, 500), (3, 12000, 800)],  
  ['Roll_Number', 'Fees', 'Fine']) 
  
# Define a list of elements 
student_names = ['Aman', 'Ishita', 'Vinayak'] 
  
# Create a column with continuous increasing Id's 
df = df.withColumn("num_id", row_number().over( 
    Window.orderBy(monotonically_increasing_id()))) 
  
# Create a new column by calling the user defined function 
new_df = df.withColumn('Names', labels_udf('num_id')) 
  
# Delete the increasing Id's column and display the data frame 
new_df.drop('num_id').show() 

Output:

Example 2:

In this example, we have used a data set (link), i.e., basically, a 5*5 data set as follows:

Then, we assigned continuously increasing IDs to the data frame using the monotonically_increasing_id function. Also, we defined a list of values, i.e., fine_data which needs to be added as a column to the data frame. Then, with the UDF on increasing Id’s, we assigned values of the list as a column to the data frame and finally displayed the data frame after dropping the increasing Id’s column.

Python3

# PySpark - Adding a Column from a list of values using a UDF 
  
# Import the libraries SparkSession, functions, IntegerType, 
# StringType, row_number, monotonically_increasing_id and Window 
from pyspark.sql import SparkSession, functions as F 
from pyspark.sql.types import IntegerType, StringType 
from pyspark.sql.functions import row_number, monotonically_increasing_id 
from pyspark.sql.window import Window 
  
# Create a spark session using getOrCreate() function 
spark_session = SparkSession.builder.getOrCreate() 
  
# Create a user defined function to assign student names 
# according to the row index by subtracting row number from 1 
labels_udf = F.udf(lambda indx: fine_data[indx-1], IntegerType()) 
  
# Read the CSV file 
df = csv_file = spark_session.read.csv( 
    '/content/student_data.csv', sep=',', inferSchema=True, header=True) 
  
# Define a list of elements 
fine_data = [200, 300, 400, 0, 500] 
  
# Create a column with continuous increasing Id's 
df = df.withColumn("num_id", row_number().over( 
    Window.orderBy(monotonically_increasing_id()))) 
  
# Create a new column by calling the user defined function 
new_df = df.withColumn('fine', labels_udf('num_id')) 
  
# Delete the increasing Id's column and display the data frame 
new_df.drop('num_id').show() 

Output:

PySpark – Adding a Column from a list of values using a UDF

PySpark – Adding a Column from a list of values using a UDF

Example 1:

Python3

Example 2:

Python3

Working with Titles and Heading – Python docx Module

Creating a Receipt Calculator using Python

One Liner for Python if-elif-else Statements

LEAVE A REPLY Cancel reply

Most Popular

Android 16 QPR2 Beta 3 lands with a flurry of bug fixes

Google is working on dedicated ‘Bills’ and ‘Travel’ folders for Gmail

Mint Mobile’s big bet on 5G home internet might change everything

Interviewed With Kyle Smith – Founder and CEO of Escalated by Shauli Zacks

EDITOR PICKS

Android 16 QPR2 Beta 3 lands with a flurry of bug fixes

Google is working on dedicated ‘Bills’ and ‘Travel’ folders for Gmail

Mint Mobile’s big bet on 5G home internet might change everything

POPULAR POSTS

Android 16 QPR2 Beta 3 lands with a flurry of bug fixes

Google is working on dedicated ‘Bills’ and ‘Travel’ folders for Gmail

Mint Mobile’s big bet on 5G home internet might change everything

POPULAR CATEGORY

ABOUT US

FOLLOW US