In this article, we are going to add suffixes and prefixes to all columns using Pyspark in Python.
An open-source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. While working in Pyspark, have you ever got the requirement to add suffixes or prefixes or both to all the columns in the Pyspark data frame? Don’t know how to fulfill the requirement? Read the article further to know the various methods to add suffixes and prefixes to all columns in Pyspark.
Methods to add suffixes and prefixes to all columns in PySpark
- Using loops
- Using reduce function
- Using the alias
- Using add_prefix function
- Using add_suffix function
- Using toDF function
Method 1: Using loops
A process that can be used to repeat a certain part of code is known as looping. In this method, we will see how we can add suffixes or prefixes, or both using loops on all the columns of the data frame created by the user or read through the CSV file. What we will do is create a loop to rename all the columns one by one.
Steps to add Suffixes and Prefix using loops:
Step 1: First of all, import the required library, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Create a spark session using the getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Read the CSV file for which you want to rename the column names with prefixes or suffixes or create the data frame using the createDataFrame function.
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)], ['column_name_1', 'column_name_2', 'column_name_3'])
or
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Later on, obtain all the columns in the list using the columns function.
total_columns=data_frame.columns
Step 5: Further, run a loop to rename all the columns of the data frame with prefixes, suffixes, or both.
for i in range(len(total_columns)): data_frame=data_frame.withColumnRenamed(total_columns[i], 'prefix_'+ total_columns[i]) # Below loop is for to add suffix for i in range(len(total_columns)): data_frame=data_frame.withColumnRenamed(total_columns[i], total_columns[i]+'_suffix')
Step 6: Finally, display the updated data frame.
data_frame.show()
Example:
In this example, we have used the data frame (link) for which we have got all the column names in the list. Further, we have run a loop to rename all the column names with the prefix ‘prefix_‘ and displayed the data frame.
Python3
# Import the libraries SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/class_data.csv' , sep = ',' , inferSchema = True , header = True ) # Get all the columns of data frame in a list total_columns = data_frame.columns # Run loop to rename all the columns of data frame with prefix 'prefix_' for i in range ( len (total_columns)): data_frame = data_frame.withColumnRenamed(total_columns[i], 'prefix_' + total_columns[i]) # Display the data frame data_frame.show() |
Output:
Method 2: Using reduce function
An aggregate action function that is used to calculate the min, the max, and the total of elements in a dataset is known as reduce() function. In this method, we will see how we can add suffixes or prefixes, or both using reduce function on all the columns of the data frame created by the user or read through the CSV file. What we will do is apply the reduce() function on the data frame with the function to rename the columns using the withColumnRenamed() function.
Steps to add Suffixes and Prefixes using reduce function:
Step 1: First of all, import the required libraries, i.e., SparkSession and functools. The SparkSession library is used to create the session while the functools is a function that acts on or returns other functions.
from pyspark.sql import SparkSession import functools
Step 2: Now, create a spark session using the getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file for which you want to rename the column names with prefixes or suffixes or create the data frame using the createDataFrame function.
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)], ['column_name_1', 'column_name_2', 'column_name_3'])
or
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Further, rename all the columns of the data frame with prefixes, suffixes, or both using reduce function.
data_frame = functools.reduce(lambda data_frame, idx: data_frame.withColumnRenamed(list(data_frame.schema.names)[idx], list(data_frame.schema.names)[idx] + '_suffix'), range(len(list(data_frame.schema.names))), data_frame)
Step 5: Finally, display the updated data frame.
data_frame.show()
Example:
In this example, we have used the data frame (link) for which we have renamed all the column names with the suffix ‘_suffix‘ using the reduce function and displayed the data frame.
Python3
# Python program to add suffix and prefix # to all columns in PySpark using reduce function # Import the libraries SparkSession and functools libraries from pyspark.sql import SparkSession import functools # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/class_data.csv' , sep = ',' , inferSchema = True , header = True ) # Rename all the columns of data frame with suffix '_suffix' using reduce function data_frame = functools. reduce ( lambda data_frame, idx: data_frame.withColumnRenamed( list (data_frame.schema.names)[idx], list (data_frame.schema.names)[idx] + '_suffix' ), range ( len ( list (data_frame.schema.names))), data_frame) # Display the data frame data_frame.show() |
Output:
Methods 3: Using the alias
A method that is used to make a special significance for a column or table in Pyspark that is more often readable and shorter is known as the alias function. In this method, we will see how we can add suffixes or prefixes, or both using the alias on all the columns of the data frame created by the user or read through the CSV file. What we will do is take the name of all the columns in the list and add suffix or prefix to all the values of that list, with further updating the data frame with new column names.
Steps to add Suffixes and Prefixes using an alias:
Step 1: First of all, import the required libraries, i.e., SparkSession and col. The SparkSession library is used to create the session while the col is used to return a column based on the given column name.
from pyspark.sql import SparkSession from pyspark.sql.functions import col
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file for which you want to rename the column names with prefixes or suffixes or create the data frame using the createDataFrame function.
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)], ['column_name_1', 'column_name_2', 'column_name_3'])
or
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Further, rename all the columns of the data frame with prefixes, suffixes, or both using an alias.
updated_columns = [col(col_name).alias("prefix_" + col_name + "_suffix") for col_name in data_frame.columns]
Step 5: Finally, display the updated data frame setting the renamed column names.
data_frame.select(*updated_columns).show()
Example:
In this example, we have used the data frame (link) for which we have renamed all the column names with the prefix ‘prefix_‘ and the suffix ‘_suffix‘ using the alias. Finally, we have set the renamed column names to the data frame and displayed the data frame.
Python3
# Python program to add suffix and prefix # to all columns in PySpark using alias # Import the libraries SparkSession and col libraries from pyspark.sql import SparkSession from pyspark.sql.functions import col # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file data_frame = csv_file = spark_session.read.csv( '/content/class_data.csv' , sep = ',' , inferSchema = True , header = True ) # Rename all the columns of data frame with prefix 'prefix_' # and suffix '_suffix' using alias updated_columns = [col(col_name).alias( "prefix_" + col_name + "_suffix" ) for col_name in data_frame.columns] # Display the updated data frame with new column names data_frame.select( * updated_columns).show() |
Output:
Method 4: Using the add_prefix function
A method to add the prefix to all the names of the columns of the data frame is known as the add_prefix() function. In this method, we will see how we can add prefixes using the add_prefix on all the columns of the Pyspark Pandas data frame created by the user or read through the CSV file. What we will do is use the add_prefix() function on the data frame with suffix as an argument.
Steps to add Prefixes using the add_prefix function:
Step 1: First of all, import the required libraries, i.e., Pandas, which is used to represent the pandas DataFrame, but it holds the PySpark DataFrame internally.
from pyspark import pandas
Step 2: Now, create the data frame using the DataFrame function with the columns.
pyspark_pandas=pandas.DataFrame({'column_name_1':[column_1_data], 'column_name_2':[column_2_data], 'column_name_3':[column_3_data]})
Step 3: Finally, add the suffix using the add_preffix function and print the data frame.
print(pyspark_pandas.add_prefix('prefix_'))
Example:
In this example, we have created the data frame with columns ‘name‘, ‘age‘, and ‘fees‘ as follows:
Then, we applied the add_prefix function to add the prefix ‘Student_‘ on the data frame and displayed it.
Python3
# Python program to add prefix to all columns # in PySpark using add_prefix function # Import the pandas library from pyspark import pandas # Create dataframe from pandas pyspark with columns 'name', 'age' and 'fees' pyspark_pandas = pandas.DataFrame({ 'name' :[ 'Arun' , 'Aniket' , 'Ishita' , 'Vinayak' ], 'age' :[ 17 , 18 , 19 , 21 ], 'fees' :[ 12000 , 10000 , 14000 , 16000 ]}) # Add prefix 'Student_' using add_prefix and display data frame print (pyspark_pandas.add_prefix( 'Student_' )) |
Output:
Method 5: Using the add_suffix function
A method to add the suffix to all the names of the columns of the data frame is known as the add_suffix function. In this method, we will see how we can add suffixes using the add_suffix on all the columns of the Pyspark Pandas data frame created by the user or read through the CSV file. What we will do is use the add_suffix function on the data frame with suffix as an argument.
Steps to add Suffixes using the add_suffix function:
Step 1: First of all, import the required libraries, i.e., Pandas, which is used to represent the pandas DataFrame, but it holds the PySpark DataFrame internally.
from pyspark import pandas
Step 2: Now, create the data frame using the DataFrame function with the columns.
pyspark_pandas=pandas.DataFrame({'column_name_1':[column_1_data], 'column_name_2':[column_2_data], 'column_name_3':[column_3_data]})
Step 3: Finally, add the suffix using the add_suffix function and print the data frame.
print(pyspark_pandas.add_suffix('_suffix'))
Example:
In this example, we have created the data frame with columns ‘name‘, ‘age‘, and ‘fees‘ as follows:
Then, we applied the add_suffix function to add the suffix ‘_suffix‘ on the data frame and displayed it.
Python3
# Python program to add suffix to all columns # in PySpark using add_suffix function # Import the pandas library from pyspark import pandas # Create dataframe from pandas pyspark with columns 'name', 'age' and 'fees' pyspark_pandas = pandas.DataFrame({ 'name' : [ 'Arun' , 'Aniket' , 'Ishita' , 'Vinayak' ], 'age' : [ 17 , 18 , 19 , 21 ], 'fees' : [ 12000 , 10000 , 14000 , 16000 ]}) # Add suffix '_suffix' using add_suffix and display data frame print (pyspark_pandas.add_suffix( '_suffix' )) |
Output:
Method 6: Using the toDF function
A method in PySpark that is used to create a Data frame in PySpark is known as the toDF() function. In this method, we will see how we can add suffixes or prefixes, or both using the toDF function on all the columns of the data frame created by the user or read through the CSV file. What we will do is create a new data frame and put the values of an existing data frame in the new data frame with the new column names.
Steps to add Suffixes and Prefixes using the toDF function:
Step 1: First of all, import the required libraries, i.e., SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, either create the data frame or read the CSV file for which you want to rename the column names with prefixes or suffixes.
data_frame=spark_session.createDataFrame([(column_1_data), (column_2_data), (column_3_data)], ['column_name_1', 'column_name_2', 'column_name_3'])
or
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Further, define the new column names which you want to give to all the columns.
columns=['new_column_name_1','new_column_name_2','new_column_name_3']
Step 5: Finally, use the function toDF and assign the names to the data frame and display it.
data_frame.toDF(*columns).show()
Example:
In this example we have created the data frame with three columns ‘Roll_Number‘, ‘Fees‘, and ‘Fine‘ as follows:
Then, we defined a list with new column names and allocated those names to the columns of the data frame using the toDF function.
Python3
# Python program to add suffix and prefix # to all columns in PySpark using toDF function # Import the libraries SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Create a data frame with three columns 'Roll_Number,' 'Fees' and 'Fine' data_frame = spark_session.createDataFrame( [( 1 , 10000 , 400 ), ( 2 , 14000 , 500 ), ( 3 , 12000 , 800 )], [ 'Roll_Number' , 'Fees' , 'Fine' ]) # Define the new column names to be given to data frame columns = [ 'Student_Roll_Number' , 'Student_Fees' , 'Student_Fine' ] # Allocate the new column names to the data frame and display it data_frame.toDF( * columns).show() |
Output: