pyspark.sql.functions provide a function split() which is used to split DataFrame string Column into multiple columns.
Syntax: pyspark.sql.functions.split(str, pattern, limit=- 1)
Parameters:
- str: str is a Column or str to split.
- pattern: It is a str parameter, a string that represents a regular expression. This should be a Java regular expression.
- limit: It is an int parameter. Optional an integer value when specified controls the number of times the pattern is applied.
- limit > 0: The resulting array length must not be more than limit specified.
- limit <= 0: The pattern must be applied as many times as possible or till the limit.
First Let’s create a DataFrame.
Python3
# installing pyspark !pip install pyspark # importing pyspark import pyspark # importing SparkSession from pyspark.sql import SparkSession # importing all from pyspark.sql.function from pyspark.sql.functions import * # creating SparkSession object spark = SparkSession.builder.appName( "sparkdf" ).getOrCreate() # creating the row data for dataframe data = [( 'Jaya' , 'Sinha' , 'F' , '1991-04-01' ), ( 'Milan' , 'Sharma' , ' ', ' 2000 - 05 - 19 '), ( 'Rohit' , 'Verma' , 'M' , '1978-09-05' ), ( 'Maria' , 'Anne' , 'F' , '1967-12-01' ), ( 'Jay' , 'Mehta' , 'M' , '1980-02-17' ) ] # giving the column names for the dataframe columns = [ 'First Name' , 'Last Name' , 'Gender' , 'DOB' ] # creating the dataframe df df = spark.createDataFrame(data, columns) # printing dataframe schema df.printSchema() # show dataframe df.show() |
Output:
Example 1: Split column using withColumn()
In this example, we created a simple dataframe with the column ‘DOB’ which contains the date of birth in yyyy-mm-dd in string format. Using the split and withColumn() the column will be split into the year, month, and date column.
Python3
# split() function defining parameters split_cols = pyspark.sql.functions.split(df[ 'DOB' ], '-' ) # Now applying split() using withColumn() df1 = df.withColumn( 'Year' , split_cols.getItem( 0 )) \ .withColumn( 'Month' , split_cols.getItem( 1 )) \ .withColumn( 'Day' , split_cols.getItem( 2 )) # show df df1.show() |
Output:
Alternatively, we can also write like this, it will give the same output:
Python3
# defining split() along with withColumn() df2 = df.withColumn( 'Year' , split(df[ 'DOB' ], '-' ).getItem( 0 )) \ .withColumn( 'Month' , split(df[ 'DOB' ], '-' ).getItem( 1 )) \ .withColumn( 'Day' , split(df[ 'DOB' ], '-' ).getItem( 2 )) # show df2 df2.show() |
Output:
In the above example we have used 2 parameters of split() i.e.’ str’ that contains the column name and ‘pattern’ contains the pattern type of the data present in that column and to split data from that position.
Example 2: Split column using select()
In this example we will use the same DataFrame df and split its ‘DOB’ column using .select():
Python3
# creating the row data for dataframe data = [( 'Jaya' , 'Sinha' , 'F' , '1991-04-01' ), ( 'Milan' , 'Sharma' , ' ', ' 2000 - 05 - 19 '), ( 'Rohit' , 'Verma' , 'M' , '1978-09-05' ), ( 'Maria' , 'Anne' , 'F' , '1967-12-01' ), ( 'Jay' , 'Mehta' , 'M' , '1980-02-17' ) ] # giving the column names for the dataframe columns = [ 'First Name' , 'Last Name' , 'DOB' ] # creating the dataframe df df = spark.createDataFrame(data, columns) # printing dataframe schema df.printSchema() # show dataframe df.show() # defining split () split_cols = pyspark.sql.functions.split(df[ 'DOB' ], '-' ) # applying split() using select() df3 = df.select( 'First Name' , 'Last Name' , 'Gender' , 'DOB' , split_cols.getItem( 0 ).alias( 'year' ), split_cols.getItem( 1 ).alias( 'month' ), split_cols.getItem( 2 ).alias( 'day' )) # show df3 df3.show() |
Output:
In the above example, we have not selected the ‘Gender’ column in select(), so it is not visible in resultant df3.
Example 3: Splitting another string column
Python3
# creating the row data for dataframe data = [( 'Jaya' , 'Sinha' ), ( 'Milan' , 'Soni' ), ( 'Rohit' , 'Verma' ), ( 'Maria' , 'Anne' ), ( 'Jay' , 'Mehta' )] # giving the column names for the dataframe columns = [ 'First Name' , 'Last Name' ] # creating the dataframe df df = spark.createDataFrame(data, columns) # printing dataframe schema df.printSchema() # show dataframe df.show() # defining split() split_cols = pyspark.sql.functions.split(df[ 'Last Name' ], '') # applying split() using .withColumn() df = df.withColumn( '1' , split_cols.getItem( 0 )) \ .withColumn( '2' , split_cols.getItem( 1 )) \ .withColumn( '3' , split_cols.getItem( 2 )) \ .withColumn( '4' , split_cols.getItem( 3 )) \ .withColumn( '5' , split_cols.getItem( 4 )) # show df df.show() |
Output:
In the above example, we have taken only two columns First Name and Last Name and split the Last Name column values into single characters residing in multiple columns.