In this article, we are going to see how to add a column with the literal value in PySpark Dataframe.
Creating dataframe for demonstration:
Python3
# import SparkSession from the pyspark from pyspark.sql import SparkSession # build and create the # SparkSession with name "lit_value" spark = SparkSession.builder.appName( "lit_value" ).getOrCreate() # create the spark dataframe with columns A,B data = spark.createDataFrame([( 'x' , 5 ),( 'Y' , 3 ), ( 'Z' , 5 ) ],[ 'A' , 'B' ]) # showing the schema and table data.printSchema() data.show() |
Output:
Method 1: Using Lit() function
Here we can add the constant column ‘literal_values_1’ with value 1 by Using the select method. The lit() function will insert constant values to all the rows.
Select table by using select() method and pass the arguments first one is the column name, or “*” for selecting the whole table and second argument pass the lit() function with constant values.
Python3
# Import the lit() function # from the pyspark.sql.functions from pyspark.sql.functions import lit # select all the columns from data # table and insert new columns # 'literal_values_1' with values 1 df2 = data.select( '*' ,lit( "1" ).alias( "literal_values_1" )) # showing the schema and updated table df2.printSchema() df2.show() |
Output:
Method 2: Using SQL clause
In this method first, we have to create the temp view of the table with the help of createTempView we can create the temporary view. The life of this temp is up to the life of the sparkSession. CreateOrReplace will create the temp table if it is not available or if it is available then replace it.
Then after creating the table select the table by SQL clause which will take all the values as a string
Python3
# this will create a temp view of the table as lit_val df2.createOrReplaceTempView( "temp" ) # select all the columns and rows # from data table and insert new # columns 'literal_values_2' with values 2 df2 = spark.sql( "select *, 2 as literal_values_2 from temp" ) # showing the schema and updated table df2.printSchema() df2.show() |
Output:
Method 3: Using UDF(User-defined Functions) Method
This function allows us to create the new function as per our requirements that’s why this is also called a user-defined function. Now we define the datatype of the UDF function and create the functions which will return the values in the form of a new column
Python3
# import the udf from pyspark from pyspark.sql.functions import udf # defining the data types of udf which is # integer type @udf ( "int" ) # defining the lit_col() function which # will return literal values to data frame def lit_col(): return 3 # create new column as # 'literal_values_3' with values 3 df2 = df2.withColumn( 'literal_values_3' , lit_col()) # showing the schema and updated table df2.printSchema() df2.show() |
Output: