Saturday, October 18, 2025
HomeLanguagesPySpark – Adding a Column from a list of values using a...

PySpark – Adding a Column from a list of values using a UDF

A data frame that is similar to a relational table in Spark SQL, and can be created using various functions in SparkSession is known as a Pyspark data frame. There occur various circumstances in which you get data in the list format but you need it in the form of a column in the data frame. If a similar situation has occurred with you, then you can do it easily by assigning increasing IDs to the data frame and then adding the values in a column. Read the article further to know more about it in detail.

PySpark – Adding a Column from a list of values using a UDF

Example 1:

In the example, we have created a data frame with three columns ‘Roll_Number‘, ‘Fees‘, and ‘Fine‘ as follows:

 

Once created, we assigned continuously increasing IDs to the data frame using the monotonically_increasing_id function. Also, we defined a list of values, i.e., student_names which need to be added as a column to a data frame. Then, with the UDF on increasing Id’s, we assigned values of the list as a column to the data frame and finally displayed the data frame after dropping the increasing Id’s column.

Python3




# PySpark - Adding a Column from a
# list of values using a UDF
  
# Import the libraries SparkSession,
# functions, IntegerType,
# StringType, row_number,
# monotonically_increasing_id and Window
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create a user defined function to assign student names
# according to the row index by subtracting row number from 1
labels_udf = F.udf(lambda indx: student_names[indx-1], StringType())
  
# Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine'
df = spark_session.createDataFrame(
    [(1, 10000, 400), (2, 14000, 500), (3, 12000, 800)], 
  ['Roll_Number', 'Fees', 'Fine'])
  
# Define a list of elements
student_names = ['Aman', 'Ishita', 'Vinayak']
  
# Create a column with continuous increasing Id's
df = df.withColumn("num_id", row_number().over(
    Window.orderBy(monotonically_increasing_id())))
  
# Create a new column by calling the user defined function
new_df = df.withColumn('Names', labels_udf('num_id'))
  
# Delete the increasing Id's column and display the data frame
new_df.drop('num_id').show()


Output:

 

Example 2:

In this example, we have used a data set (link), i.e., basically, a 5*5 data set as follows:

 

Then, we assigned continuously increasing IDs to the data frame using the monotonically_increasing_id function. Also, we defined a list of values, i.e., fine_data which needs to be added as a column to the data frame. Then, with the UDF on increasing Id’s, we assigned values of the list as a column to the data frame and finally displayed the data frame after dropping the increasing Id’s column.

Python3




# PySpark - Adding a Column from a list of values using a UDF
  
# Import the libraries SparkSession, functions, IntegerType,
# StringType, row_number, monotonically_increasing_id and Window
from pyspark.sql import SparkSession, functions as F
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql.window import Window
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Create a user defined function to assign student names
# according to the row index by subtracting row number from 1
labels_udf = F.udf(lambda indx: fine_data[indx-1], IntegerType())
  
# Read the CSV file
df = csv_file = spark_session.read.csv(
    '/content/student_data.csv', sep=',', inferSchema=True, header=True)
  
# Define a list of elements
fine_data = [200, 300, 400, 0, 500]
  
# Create a column with continuous increasing Id's
df = df.withColumn("num_id", row_number().over(
    Window.orderBy(monotonically_increasing_id())))
  
# Create a new column by calling the user defined function
new_df = df.withColumn('fine', labels_udf('num_id'))
  
# Delete the increasing Id's column and display the data frame
new_df.drop('num_id').show()


Output:

 

Dominic
Dominichttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Dominic
32361 POSTS0 COMMENTS
Milvus
88 POSTS0 COMMENTS
Nango Kala
6728 POSTS0 COMMENTS
Nicole Veronica
11892 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11954 POSTS0 COMMENTS
Shaida Kate Naidoo
6852 POSTS0 COMMENTS
Ted Musemwa
7113 POSTS0 COMMENTS
Thapelo Manthata
6805 POSTS0 COMMENTS
Umr Jansen
6801 POSTS0 COMMENTS