In this article, we are going to learn how to create multiple lags using pyspark in Python.
What is lag in Pyspark?
The lag lets our query on more than one row of a table and return the previous row in the table. Have you ever got the need to create multiple lags in Pyspark? Don’t know how to achieve this for the Pyspark data frame? Don’t worry, just read the article further to know more about creating multiple lags in Pyspark.
Syntax: data_frame= data_frame.withColumn(column_name,func.lag(data_frame[previous_column_name], lag_by_times).over(Windowspec))
Here,
- previous_column_name: It represents the column name in which lag has to be created.
- column_name: It represents the column name which needs to be kept after the lag.
- lag_by_times: It represents how much lag you want for each execution. By default, its value is 1.
Note: You can keep on repeating the particular syntax for the number of times you want lag in the Pyspark program.
How to create multiple lags in Pyspark?
Step 1: First of all, import the required libraries, i.e. SparkSession, Window, and functions. The SparkSession library is used to create the session.
Step 2: Now, create a spark session using the getOrCreate function.
Step 3: Read the CSV file and display it to see if it is correctly uploaded.
Step 4: Rearrange the data through any column name using the Window function.
Step 5: Further, create a loop ‘n’ number of times for getting lags each time and putting in new columns. In this step, we will define the previous column names for each loop execution on which lag will be applied.
Step 6: Moreover, create the new column names for storing the lagged data for each loop execution.
Step 7: Calculate the lag of the column by 1 each time till the loop executes for each entry and puts in new columns respectively.
Step 8: Finally, display the lagged data frame out of the loop.
Example 1:
In this example, we have used the data set (link), i.e., the dataset of 5×5, on which we applied lag two times. The first lag was of one row and created the new column ‘Updated Fine 1‘, while the second lag was of two rows and created the new column ‘Updated Fine 2‘.
Python3
# Python program to create multiple lags in Pyspark # Import the SparkSession, Window and func libraries from pyspark.sql import SparkSession from pyspark.sql.window import Window import pyspark.sql.functions as func # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/student_data1.csv' , sep = ',' , inferSchema = True , header = True ) # Sort data through class in Window function Windowspec = Window.orderBy( "class" ) # Calculating lag of fine by 1 for each student and # putting in new column 'Updated Fine 1' updated_fine_1 = data_frame.withColumn( 'Updated Fine 1' , func.lag(data_frame[ 'fine' ]).over(Windowspec)) # Calculating lag of fine by 2 for each student and putting in # new column 'Updated Fine 2' updated_fine_2 = updated_fine_1.withColumn( 'Updated Fine 2' , func.lag( updated_fine_1[ 'Updated Fine 1' ], 2 ).over(Windowspec)) # Displaying the lagged updated fine updated_fine_2.show() |
Output:
Example 2:
In this example, we have used the data set (link), i.e., the dataset of 5×5, on which we applied lag four times through a for loop. Each time the loop is executed, it creates a lag of one row and creates the new columns ‘Updated Fine 1‘, ‘Updated Fine 2‘ and so on.
Python3
# Python program to create multiple # lags in Pyspark using loops # Import the SparkSession, Window and func libraries from pyspark.sql import SparkSession from pyspark.sql.window import Window import pyspark.sql.functions as func # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/student_data1.csv' , sep = ',' , inferSchema = True , header = True ) # Sort data through class in Window function Windowspec = Window.orderBy( "class" ) # Calculating lag of fine by 1 for each student and putting in new # column 'Updated Fine 1' data_frame = data_frame.withColumn( 'Updated Fine 1' , func.lag(data_frame[ 'fine' ]).over(Windowspec)) # Create a loop 3 times for getting lags each time and putting in new columns for i in range ( 1 , 4 ): # Getting the previous column names for each loop execution # on which lag will be applied previous_column_name = 'Updated Fine ' + str (i) # Creating the new column names for each loop execution column_name = 'Updated Fine ' + str (i + 1 ) # Calculating lag of fine by 1 each time till the loop executes # for each student and putting in new columns data_frame = data_frame.withColumn(column_name, func.lag( data_frame[previous_column_name]).over(Windowspec)) # Displaying the lagged updated fine data_frame.show() |
Output: