Sunday, September 22, 2024
Google search engine
HomeLanguagesCreate MapType Column from Existing Columns in PySpark

Create MapType Column from Existing Columns in PySpark

An RDD transformation that applies the transformation function to every element of the data frame is known as a map in Pyspark. There occurs various situations when you have numerous columns and you need to convert them to map-type columns. It can be done easily by using the create_map function with the map key column name and column name as arguments. Continue reading the article further to know about it in detail.

Syntax: df.withColumn(“map_column_name”,create_map( lit(“mapkey_1”),col(“column_1”), lit(“mapkey_2”),col(“column_2”) )).drop( “column_1”, “column_2” ).show(truncate=False)

Here,

  • column_1, column_2, column_3: These are the column names which needs to be converted to map.
  • mapkey_1, mapkey_2, mapkey_3: These are the names of the map keys to be given to data on creation of map.
  • map_column_name: It is the name given to the column in which map is stored.

Example 1:

In this example, we have used a data set (link), which is basically a 5×5 data frame as follows:

 

Then, we converted the columns ‘name,’ ‘class’ and ‘fees’ to map using the create_map function and stored them in the column ‘student_details‘ dropping the existing ‘name,’ ‘class’ and ‘fees’ columns.

Python3




# PySpark - Create MapType Column from existing columns
  
# Import the libraries SparkSession, col, lit, create_map
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, create_map
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
    '/content/class_data.csv', sep=',', inferSchema=True, header=True)
  
# Convert name, class and fees columns to map
data_frame = data_frame.withColumn("student_details",
                                   create_map(lit("student_name"), 
                                              col("name"), 
                                              lit( "student_class"),
                                              col("class"),
                                              lit("student_fees"), 
                                              col("fees"))).drop("name",
                                                                 "class",
                                                                 "fees")
  
# Display the data frame
data_frame.show(truncate=False)


Output:

 

Example 2:

In this example, we have created a data frame with columns emp_id, name, superior_emp_id, year_joined, emp_dept_id, gender, and salary as follows: 

 

Then, we converted the columns name, superior_emp_id, year_joined, emp_dept_id, gender, and salary to map using the create_map function and stored in the column ‘employee_details‘ dropping the existing name, superior_emp_id, year_joined, emp_dept_id, gender, and salary columns.

Python3




#PySpark - Create MapType Column from existing columns
  
# Import the libraries SparkSession, col, lit, create_map
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,lit,create_map
  
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
  
# Define the data set
emp = [(1,"Smith",-1,"2018","10","M",3000),
       (2,"Rose",1,"2010","20","M",4000), 
       (3,"Williams",1,"2010","10","M",1000),
       (4,"Jones",2,"2005","10","F",2000), 
       (5,"Brown",2,"2010","40","F",4000), 
       (6,"Brown",2,"2010","50","M",2000) ]
  
# Define the schema of the data set
empColumns = ["emp_id","name","superior_emp_id",
              "year_joined", "emp_dept_id",
              "gender","salary"]
  
# Create the data frame through data set and schema
empDF = spark_session.createDataFrame(data=emp, 
                                      schema = empColumns)
  
# Convert name, superior_emp_id, year_joined, emp_dept_id, gender, and salary columns to maptype column
empDF = empDF.withColumn("employee_details",
                         create_map(lit("name"),
                                    col("name"), 
                                    lit("superior_emp_id"),
                                    col("superior_emp_id"), 
                                    lit("year_joined"),
                                    col("year_joined"), 
                                    lit("emp_dept_id"),
                                    col("emp_dept_id"), 
                                    lit("gender"),
                                    col("gender"), 
                                    lit("salary"),
                                    col("salary"))).drop("name",
                                                         "superior_emp_id",
                                                         "year_joined"
                                                         "emp_dept_id",
                                                         "gender"
                                                         "salary")
  
# Display the data frame
empDF.show(truncate=False)


Output:

 

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments