In this article, we will discuss how to drop columns in the Pyspark dataframe.
In pyspark the drop() function can be used to remove values/columns from the dataframe.
Syntax: dataframe_name.na.drop(how=”any/all”,thresh=threshold_value,subset=[“column_name_1″,”column_name_2”])
- how – This takes either of the two values ‘any’ or ‘all’. ‘any’, drop a row if it contains NULLs on any columns and ‘all’, drop a row only if all columns have NULL values. By default it is set to ‘any’
- thresh – This takes an integer value and drops rows that have less than that thresh hold non-null values. By default it is set to ‘None’.
- subset – This parameter is used to select a specific column to target the NULL values in it. By default it’s ‘None
Python code to create student dataframe with three columns:
Python3
# importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName( 'sparkdf' ).getOrCreate() # list of employee data with 5 row values data = [[ "1" , "sravan" , "company 1" ], [ "3" , "bobby" , "company 3" ], [ "2" , "ojaswi" , "company 2" ], [ "1" , "sravan" , "company 1" ], [ "3" , "bobby" , "company 3" ], [ "4" , "rohith" , "company 2" ], [ "5" , "gnanesh" , "company 1" ]] # specify column names columns = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data,columns) dataframe.show() |
Output:
+-----------+-------------+------------+ |Employee ID|Employee NAME|Company Name| +-----------+-------------+------------+ | 1| sravan| company 1| | 3| bobby| company 3| | 2| ojaswi| company 2| | 1| sravan| company 1| | 3| bobby| company 3| | 4| rohith| company 2| | 5| gnanesh| company 1| +-----------+-------------+------------+
Example 1: Delete a single column.
Here we are going to delete a single column from the dataframe.
Syntax: dataframe.drop(‘column name’)
Code:
Python3
# delete single column dataframe = dataframe.drop( 'Employee ID' ) dataframe.show() |
Output:
+-------------+------------+ |Employee NAME|Company Name| +-------------+------------+ | sravan| company 1| | bobby| company 3| | ojaswi| company 2| | sravan| company 1| | bobby| company 3| | rohith| company 2| | gnanesh| company 1| +-------------+------------+Example 2:
Example 2: Delete multiple columns.
Here we will delete multiple columns from the dataframe.
Syntax: dataframe.drop(*(‘column 1′,’column 2′,’column n’))
Code:
Python3
# delete two columns dataframe = dataframe.drop( * ( 'Employee NAME' , 'Employee ID' )) dataframe.show() |
Output:
+------------+ |Company Name| +------------+ | company 1| | company 3| | company 2| | company 1| | company 3| | company 2| | company 1| +------------+
Example 3: Delete all columns
Here we will delete all the columns from the dataframe, for this we will take column’s name as a list and pass it into drop().
Python3
list = [ 'Employee ID' , 'Employee NAME' , 'Company Name' ] # delete two columns dataframe = dataframe.drop( * list ) dataframe.show() |
Output:
++ || ++ || || || || || || || ++