In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column.
We can get the substring of the column using substring() and substr() function.
Syntax: substring(str,pos,len)
df.col_name.substr(start, length)
Parameter:
- str – It can be string or name of the column from which we are getting the substring.
- start and pos – Through this parameter we can give the starting position from where substring is start.
- length and len – It is the length of the substring from the starting position.
Let’s create a dataframe.
Python3
# importing necessary libraries from pyspark.sql import SparkSession from pyspark.sql.functions import col, substring # function to create new SparkSession def create_session(): spk = SparkSession.builder \ .master( "local" ) \ .appName( "Substring.com" ) \ .getOrCreate() return spk def create_df(spark, data, schema): df1 = spark.createDataFrame(data, schema) return df1 if __name__ = = "__main__" : input_data = [( "India" , + 91 , 2701 , 2020 ), ( "United States of America" , + 1 , 1301 , 2020 ), ( "Israel" , + 972 , 3102 , 2020 ), ( "Dubai" , + 971 , 2901 , 2020 ), ( "Russia" , 7 , 3101 , 2020 )] # calling function to create SparkSession spark = create_session() schema = [ "Country" , "Country Code" , "Data" , "Year" ] # calling function to create dataframe df = create_df(spark, input_data, schema) df.show() |
Output:
Example 1: Using substring() getting the substring and creating new column using withColumn() function.
Python
if __name__ = = "__main__" : # creating Month column and get the # substring from the Data column # creating Date column and get the # substring from the Data column df = df.withColumn( "Month" , substring( "Data" , 1 , 2 )).withColumn( "Date" , substring( "Data" , 3 , 4 )) # dropping the Data column from the # Dataframe df = df.drop( "Data" ) # printing Dataframe schema to get the # column names df.printSchema() # visualizing the dataframe df.show(truncate = False ) |
Output:
Example 2: Creating New_Country column by getting the substring using substr() function.
Python
if __name__ = = "__main__" : # Creating the new column New_Country # and store the substring using substr() df = df.withColumn( "New_Country" , df.Country.substr( 0 , 12 )) # printing Dataframe schema to get the # column names df.printSchema() # visualizing the dataframe df.show(truncate = False ) |
Output:
Example 3: Using substring() with select() function.
Python
if __name__ = = "__main__" : input_data = [( "India" , + 91 , "AidanButler" ), ( "United States of America" , + 1 , "ConerFlores" ), ( "Israel" , + 972 , "RosseBryant" ), ( "Dubai" , + 971 , "JuliaSimmon" ), ( "Russia" , 7 , "AliceBailey" )] # calling function to create SparkSession spark = create_session() schema = [ "Country" , "Country Code" , "Name" ] # calling function to create dataframe df = create_df(spark, input_data, schema) # Selecting the column using select() # function and getting substring # using substring() df2 = df.select( 'Name' , substring( 'Name' , 1 , 5 ).alias( 'First Name' ), substring( 'Name' , 6 , 6 ).alias( 'Last Name' )) # printing Dataframe schema to get the column names df2.printSchema() # visualizing the dataframe df2.show(truncate = False ) |
Output:
Example 4: Using substring() with selectExpr() function.
Python
if __name__ = = "__main__" : input_data = [( "India" , + 91 , "AidanButler" ), ( "United States of America" , + 1 , "ConerFlores" ), ( "Israel" , + 972 , "RosseBryant" ), ( "Dubai" , + 971 , "JuliaSimmon" ), ( "Russia" , 7 , "AliceBailey" )] # calling function to create SparkSession spark = create_session() schema = [ "Country" , "Country Code" , "Name" ] # calling function to create dataframe df = create_df(spark, input_data, schema) # Selecting the column using selectExpr() # function and getting substring using substring() df2 = df.selectExpr( 'Name' , 'substring(Name, 1,5) as First_Name' , 'substring(Name, 6,6) as Last_Name' ) # printing Dataframe schema to get the column names df2.printSchema() # visualizing the dataframe df2.show(truncate = False ) |
Output: