How to Get substring from a column in PySpark Dataframe ?

27 July 2024

1

In this article, we are going to see how to get the substring from the PySpark Dataframe column and how to create the new column and put the substring in that newly created column.

We can get the substring of the column using substring() and substr() function.

Syntax: substring(str,pos,len)

df.col_name.substr(start, length)

Parameter:

str – It can be string or name of the column from which we are getting the substring.

start and pos – Through this parameter we can give the starting position from where substring is start.

length and len – It is the length of the substring from the starting position.

Let’s create a dataframe.

Python3

# importing necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, substring
 
 
# function to create new SparkSession
def create_session():
    spk = SparkSession.builder \
        .master("local") \
        .appName("Substring.com") \
        .getOrCreate()
    return spk
 
def create_df(spark, data, schema):
 
    df1 = spark.createDataFrame(data, schema)
    return df1
 
if __name__ == "__main__":
 
    input_data = [("India", +91, 2701, 2020),
                  ("United States of America", +1, 1301, 2020),
                  ("Israel", +972, 3102, 2020),
                  ("Dubai", +971, 2901, 2020),
                  ("Russia", 7, 3101, 2020)]
 
    # calling function to create SparkSession
    spark = create_session()
 
    schema = ["Country", "Country Code",
              "Data", "Year"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
    df.show()

Output:

Example 1: Using substring() getting the substring and creating new column using withColumn() function.

Python

if __name__ == "__main__":
   
    # creating Month column and get the
    # substring from the Data column
    # creating Date column and get the
    # substring from the Data column
    df = df.withColumn(
      "Month", substring("Data", 1, 2)).withColumn(
      "Date", substring("Data", 3, 4))
 
    # dropping the Data column from the
    # Dataframe
    df = df.drop("Data")
 
    # printing Dataframe schema to get the
    # column names
    df.printSchema()
 
    # visualizing the dataframe
    df.show(truncate=False)

Output:

Example 2: Creating New_Country column by getting the substring using substr() function.

Python

if __name__ == "__main__":
 
    # Creating the new column New_Country
    # and store the substring using substr()
    df = df.withColumn("New_Country", df.Country.substr(0, 12))
 
    # printing Dataframe schema to get the
    # column names
    df.printSchema()
 
    # visualizing the dataframe
    df.show(truncate=False)

Output:

Example 3: Using substring() with select() function.

Python

if __name__ == "__main__":
 
    input_data = [("India", +91, "AidanButler"),
                  ("United States of America", +1, "ConerFlores"),
                  ("Israel", +972, "RosseBryant"),
                  ("Dubai", +971, "JuliaSimmon"),
                  ("Russia", 7, "AliceBailey")]
 
    # calling function to create SparkSession
    spark = create_session()
 
    schema = ["Country", "Country Code", "Name"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
 
    # Selecting the column using select()
    # function and getting substring
    # using substring()
    df2 = df.select('Name', substring('Name',
                                      1, 5).alias('First Name'),
                    substring('Name', 6, 6).alias('Last Name'))
 
    # printing Dataframe schema to get the column names
    df2.printSchema()
 
    # visualizing the dataframe
    df2.show(truncate=False)

Output:

Example 4: Using substring() with selectExpr() function.

Python

if __name__ == "__main__":
 
    input_data = [("India", +91, "AidanButler"),
                  ("United States of America", +1, "ConerFlores"),
                  ("Israel", +972, "RosseBryant"),
                  ("Dubai", +971, "JuliaSimmon"),
                  ("Russia", 7, "AliceBailey")]
 
    # calling function to create SparkSession
    spark = create_session()
 
    schema = ["Country", "Country Code", "Name"]
 
    # calling function to create dataframe
    df = create_df(spark, input_data, schema)
 
    # Selecting the column using selectExpr()
    # function and getting substring using substring()
    df2 = df.selectExpr('Name', 'substring(Name, 1,5) as First_Name',
                        'substring(Name, 6,6) as Last_Name')
 
    # printing Dataframe schema to get the column names
    df2.printSchema()
 
    # visualizing the dataframe
    df2.show(truncate=False)

Output:

How to Get substring from a column in PySpark Dataframe ?

Python3

Python

Python

Python

Python

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

This is my surprise phone of the year [Video]

Recent Comments

EDITOR PICKS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR POSTS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR CATEGORY

ABOUT US

FOLLOW US