In this article, we will see that in PySpark, we can remove white spaces in the DataFrame string column. Here we will perform a similar operation to trim() (removes left and right white spaces) present in SQL in PySpark itself.
PySpark Trim String Column on DataFrame
Below are the ways by which we can trim String Column on DataFrame in PySpark:
- Using withColumn with rtrim()
- Using withColumn with trim()
- Using select()
- Using SQL Expression
- Using PySpark trim(), rtrim(), ltrim()
PySpark Trim using withColumn() with rtrim()
In this example, we are using withColumn() with rtrim() to remove white spaces.
Python3
from pyspark.sql import SparkSession from pyspark.sql.functions import trim, ltrim, rtrim, col # Create a Spark session spark = SparkSession.builder.appName( "WhiteSpaceRemoval" ).getOrCreate() # Define your data data = [( 1 , "ABC " ), ( 2 , " DEF" ), ( 3 , " GHI " )] # Create a DataFrame df = spark.createDataFrame(data, [ "col1" , "col2" ]) # Show the initial DataFrame df.show() # Using withColumn to remove white spaces df = df.withColumn( "col2" , rtrim(col( "col2" ))) df.show() # Stop the Spark session spark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
PySpark Trim using withColumn with trim()
In this example, we are using withColumn() with trim() to remove white spaces.
Python3
from pyspark.sql import SparkSession from pyspark.sql.functions import trim, ltrim, rtrim, col # Create a Spark session spark = SparkSession.builder.appName( "WhiteSpaceRemoval" ).getOrCreate() # Define your data data = [( 1 , "ABC " ), ( 2 , " DEF" ), ( 3 , " GHI " )] # Create a DataFrame df = spark.createDataFrame(data, [ "col1" , "col2" ]) # Show the initial DataFrame df.show() # Using withColumn to remove white spaces df = df.withColumn( "col2" , trim(col( "col2" ))) df.show() # Stop the Spark session spark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
PySpark Trim using select() to Remove White Spaces
In this example, we are using select() to trim in PySpark.
Python3
from pyspark.sql import SparkSession from pyspark.sql.functions import trim, ltrim, rtrim, col # Create a Spark session spark = SparkSession.builder.appName( "WhiteSpaceRemoval" ).getOrCreate() # Define your data data = [( 1 , "ABC " ), ( 2 , " DEF" ), ( 3 , " GHI " )] # Create a DataFrame df = spark.createDataFrame(data, [ "col1" , "col2" ]) # Show the initial DataFrame df.show() # Using select to remove white spaces df = df.select(col( "col1" ), trim(col( "col2" )).alias( "col2" )) df.show() # Stop the Spark session spark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Using SQL Expression
In this example, we are using SQL Expression to remove extra whitespaces.
Python3
from pyspark.sql import SparkSession from pyspark.sql.functions import trim, ltrim, rtrim, col # Create a Spark session spark = SparkSession.builder.appName( "WhiteSpaceRemoval" ).getOrCreate() # Define your data data = [( 1 , "ABC " ), ( 2 , " DEF" ), ( 3 , " GHI " )] # Create a DataFrame df = spark.createDataFrame(data, [ "col1" , "col2" ]) # Show the initial DataFrame df.show() # Using SQL Expression to remove white spaces df.createOrReplaceTempView( "TAB" ) spark.sql( "SELECT col1, TRIM(col2) AS col2 FROM TAB" ).show() # Stop the Spark session spark.stop() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
Using PySpark trim(), rtrim(), ltrim()
In PySpark, we can easily remove whitespaces or trim by using pyspark.sql.functions.trim() function. We use ltrim() to remove only left white spaces and rtrim() function to remove right white spaces.
Python3
from pyspark.sql.functions import trim, ltrim, rtrim data = [( 1 , "ABC " ), ( 2 , " DEF" ), ( 3 , " GHI " )] df = spark.createDataFrame(data = data, schema = [ "col1" , "col2" ]) df.show() # using withColumn and trim() df.withColumn( "col2" , trim( "col2" )).show() # using ltrim() df.withColumn( "col2" , ltrim( "col2" )).show() # using rtrim() df.withColumn( "col2" , rtrim( "col2" )).show() # Using select df.select( "col1" , trim( "col2" ).alias( 'col2' )).show() # Using SQL Expression df.createOrReplaceTempView( "TAB" ) spark.sql( "select col1,trim(col2) as col2 from TAB" ).show() |
+----+----------+
|col1| col2|
+----+----------+
| 1| ABC |
| 2| DEF |
| 3| GHI|
+----+----------+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+
+----+---+
|col1|col2|
+----+---+
| 1|ABC|
| 2|DEF|
| 3|GHI|
+----+---+