Have you ever been stuck in a situation where you have got the data of numerous columns in one column? Got confused at that time about how to split that dataset? This can be easily achieved in Pyspark in numerous ways. In this article, we will discuss regarding same.
Modules Required:
Pyspark: An open source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Methods to split a list into multiple columns in Pyspark:
- Using expr in comprehension list
- Splitting data frame row-wise and appending in columns
- Splitting data frame columnwise
Method 1: Using expr in comprehension list
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while expr is an SQL function used to execute SQL-like expressions. Also, the types is used to store all the datatypes of Pyspark.
from pyspark.sql import SparkSession from pyspark.sql.functions import expr from pyspark.sql.types import *
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, define the schema for creating the data frame with an array-typed column.
mySchema = StructType([StructField("Heading", StringType(), True), StructField("Column", ArrayType(IntegerType(),True))])
Step 4: Later on, create the data frame that needs to be split into multiple columns.
data_frame = spark_session.createDataFrame([['column_heading1', [column1_data]], ['column_heading2', [column2_data]]], schema= mySchema)
Step 5: Finally, split the list into columns using expr() function in the comprehension list.
data_frame.select([expr('Column[' + str(x) + ']') for x in range(0, number_of_columns)]).show()
Example:
In this example, we have defined the schema in which we want to define the data frame and then declared the data frame in the respective schema using the list of the data. Finally, we have split that dataset using expr function in the comprehension list.
Python3
# Python program to split list to multiple columns # in Pyspark by using expr in comprehension list # Import the SparkSession, expr and types libraries from pyspark.sql import SparkSession from pyspark.sql.functions import expr from pyspark.sql.types import * # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Define schema to create DataFrame with an array typed column. mySchema = StructType([StructField( "Heading" , StringType(), True ), StructField( "Column" , ArrayType(IntegerType(), True ))]) # Create the dataframe that needs to be split in multiple columns data_frame = spark_session.createDataFrame( [[ 'A' , [ 1 , 2 , 3 ]], [ 'B' , [ 4 , 5 , 6 ]], [ 'C' , [ 7 , 8 , 9 ]]], schema = mySchema) # Split list into columns using 'expr()' in a comprehension list data_frame.select([expr( 'Column[' + str (x) + ']' ) for x in range ( 0 , 3 )]).show() |
Output:
+---------+---------+---------+ |Column[0]|Column[1]|Column[2]| +---------+---------+---------+ | 1| 2| 3| | 4| 5| 6| | 7| 8| 9| +---------+---------+---------+
Method 2: Splitting data frame row-wise and appending in columns
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session, while Row is used to represent Row in the data frame. Also, the col is used to represent the column in the data frame.
from pyspark.sql import SparkSession from pyspark import Row from pyspark.sql.functions import col
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, declare an array that you need to split into multiple columns.
arr=[[row1_data],[row2_data],[row3_data]]
Step 4: Later on, create the number of rows in the data frame.
data_frame = spark_session.createDataFrame([Row(index=1, finalArray = arr[0]), Row(index=2, finalArray = arr[1]), Row(index=3, finalArray = arr[2])])
Step 5: Finally, append the columns to the data frame.
data_frame.select([(col("finalArray")[x]).alias("Column "+str(x+1)) for x in range(0, 3)]).show()
Example:
In this example, we have declared the list for which we created the data frame that we have split row-wise and then put that split data in the columns for display.
Python3
# Python program to split list to multiple columns in Pyspark by # splitting data frame row-wise and appending in columns # Import the SparkSession, Row and col libraries from pyspark.sql import SparkSession from pyspark import Row from pyspark.sql.functions import col # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Declare the list arr = [[ 1 , 2 , 3 ], [ 4 , 5 , 6 ], [ 7 , 8 , 9 ]] # Creating the number of rows in dataframe data_frame = spark_session.createDataFrame([Row(index = 1 , finalArray = arr[ 0 ]), Row(index = 2 , finalArray = arr[ 1 ]), Row(index = 3 , finalArray = arr[ 2 ])]) # Appending new columns to the dataframe data_frame.select([(col( "finalArray" )[x]).alias( "Column " + str (x + 1 )) for x in range ( 0 , 3 )]).show() |
Output:
+--------+--------+--------+ |Column 1|Column 2|Column 3| +--------+--------+--------+ | 1| 2| 3| | 4| 5| 6| | 7| 8| 9| +--------+--------+--------+
Method 3: Splitting data frame columnwise
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, create a spark context.
sc=spark_session.sparkContext
Step 4: Later on, create the data frame that needs to be split into multiple columns.
data_frame = spark_session.createDataFrame(sc.parallelize([['column_heading1', [column1_data]], ['column_heading2', [column2_data]]]), ["key", "value"])
Step 5: Finally, split the data frame column-wise.
data_frame.select("key", data_frame.value[0], data_frame.value[1], data_frame.value[2]).show()
Example:
In this example, we have declared the list using Spark Context and then created the data frame of that list. Further, we have split the list into multiple columns and displayed that split data.
Python3
# Python program to split list to multiple columns # in Pyspark by splitting data frame columnwise # Import the SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Create a spark context sc = spark_session.sparkContext # Create the dataframe using list that needs to be split in multiple columns data_frame = spark_session.createDataFrame( sc.parallelize([[ 'Column 1' , [ 1 , 2 , 3 ]], [ 'Column 2' , [ 4 , 5 , 6 ]], [ 'Column 3' , [ 7 , 8 , 9 ]]]), [ "key" , "value" ]) # Splitting dataframe columnwise data_frame.select( "key" , data_frame.value[ 0 ], data_frame.value[ 1 ], data_frame.value[ 2 ]).show() |
Output:
+--------+--------+--------+--------+ | key|value[0]|value[1]|value[2]| +--------+--------+--------+--------+ |Column 1| 1| 2| 3| |Column 2| 4| 5| 6| |Column 3| 7| 8| 9| +--------+--------+--------+--------+