Split single column into multiple columns in PySpark DataFrame

27 July 2024

2

pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns.

Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)

Parameters:

str: str is a Column or str to split.

pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.

limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.

limit > 0: The resulting array length must not be more than limit specified.

limit <= 0: The pattern must be applied as many times as possible or till the limit.

First Let’s create a DataFrame.

Python3

# installing pyspark 
!pip install pyspark 
  
# importing pyspark 
import pyspark 
  
# importing SparkSession 
from pyspark.sql import SparkSession 
  
# importing all from pyspark.sql.function 
from pyspark.sql.functions import *
  
# creating SparkSession object 
spark=SparkSession.builder.appName("sparkdf").getOrCreate() 
  
  
# creating the row data for dataframe 
data = [('Jaya', 'Sinha', 'F', '1991-04-01'), 
        ('Milan', 'Sharma', '', '2000-05-19'), 
        ('Rohit', 'Verma', 'M', '1978-09-05'), 
        ('Maria', 'Anne', 'F', '1967-12-01'), 
        ('Jay', 'Mehta', 'M', '1980-02-17') 
        ] 
  
# giving the column names for the dataframe 
columns = ['First Name', 'Last Name', 'Gender', 'DOB'] 
  
# creating the dataframe df 
df = spark.createDataFrame(data, columns) 
  
# printing dataframe schema 
df.printSchema() 
  
# show dataframe 
df.show() 

Output:

DataFrame created

Example 1: Split column using withColumn()

In this example, we created a simple dataframe with the column ‘DOB’ which contains the date of birth in yyyy-mm-dd in string format. Using the split and withColumn() the column will be split into the year, month, and date column.

Python3

# split() function defining parameters 
split_cols = pyspark.sql.functions.split(df['DOB'], '-') 
  
# Now applying split() using withColumn() 
df1 = df.withColumn('Year', split_cols.getItem(0)) \ 
    .withColumn('Month', split_cols.getItem(1)) \ 
    .withColumn('Day', split_cols.getItem(2)) 
  
# show df 
df1.show() 

Output:

Dataframe after splitting columns

Alternatively, we can also write like this, it will give the same output:

Python3

# defining split() along with withColumn() 
df2 = df.withColumn('Year', split(df['DOB'], '-').getItem(0)) \ 
    .withColumn('Month', split(df['DOB'], '-').getItem(1)) \ 
    .withColumn('Day', split(df['DOB'], '-').getItem(2)) 
  
# show df2 
df2.show() 

Output:

In the above example we have used 2 parameters of split() i.e.’ str’ that contains the column name and ‘pattern’ contains the pattern type of the data present in that column and to split data from that position.

Example 2: Split column using select()

In this example we will use the same DataFrame df and split its ‘DOB’ column using .select():

Python3

# creating the row data for dataframe 
data = [('Jaya', 'Sinha', 'F', '1991-04-01'), 
        ('Milan', 'Sharma', '', '2000-05-19'), 
        ('Rohit', 'Verma', 'M', '1978-09-05'), 
        ('Maria', 'Anne', 'F', '1967-12-01'), 
        ('Jay', 'Mehta', 'M', '1980-02-17') 
        ] 
  
# giving the column names for the dataframe 
columns = ['First Name', 'Last Name', 'DOB'] 
  
# creating the dataframe df 
df = spark.createDataFrame(data, columns) 
  
# printing dataframe schema 
df.printSchema() 
  
# show dataframe 
df.show() 
  
# defining split () 
split_cols = pyspark.sql.functions.split(df['DOB'], '-') 
  
# applying split() using select() 
df3 = df.select('First Name', 'Last Name', 'Gender', 'DOB', 
                split_cols.getItem(0).alias('year'), 
                split_cols.getItem(1).alias('month'), 
                split_cols.getItem(2).alias('day')) 
  
# show df3 
df3.show() 

Output:

In the above example, we have not selected the ‘Gender’ column in select(), so it is not visible in resultant df3.

Example 3: Splitting another string column

Python3

# creating the row data for dataframe 
data = [('Jaya', 'Sinha'), ('Milan', 'Soni'), 
        ('Rohit', 'Verma'), ('Maria', 'Anne'),  
        ('Jay', 'Mehta')] 
  
# giving the column names for the dataframe 
columns = ['First Name', 'Last Name'] 
  
# creating the dataframe df 
df = spark.createDataFrame(data, columns) 
  
# printing dataframe schema 
df.printSchema() 
  
# show dataframe 
df.show() 
  
# defining split() 
split_cols = pyspark.sql.functions.split(df['Last Name'], '') 
  
# applying split() using .withColumn() 
df = df.withColumn('1', split_cols.getItem(0)) \ 
       .withColumn('2', split_cols.getItem(1)) \ 
       .withColumn('3', split_cols.getItem(2)) \ 
       .withColumn('4', split_cols.getItem(3)) \ 
       .withColumn('5', split_cols.getItem(4)) 
  
# show df 
df.show() 

Output:

In the above example, we have taken only two columns First Name and Last Name and split the Last Name column values into single characters residing in multiple columns.

Split single column into multiple columns in PySpark DataFrame

Python3

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

Google wants to hear your thoughts on the Android 15 QPR2 Beta

Recent Comments

EDITOR PICKS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR POSTS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR CATEGORY

ABOUT US

FOLLOW US