Friday, November 22, 2024
Google search engine
HomeLanguagesApplying function to PySpark Dataframe Column

Applying function to PySpark Dataframe Column

In this article, we’re going to learn ‘How we can apply a function to a PySpark DataFrame Column’.

Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes PySpark more powerful is its capacity to handle big data.

Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.

Install required module

Run the below command in the command prompt or terminal to install the Pyspark and pandas modules:

pip install pyspark
pip install pandas

Applying a Function on a PySpark DataFrame Column

Herein we will look at how we can apply a function on a PySpark DataFrame Column. For this purpose, we will be making use of ‘pandas_udf()’ present in ‘pyspark.sql.functions’.

Syntax:
# defining function
@pandas_udf(‘function_type’)
def function_name(argument: argument_type) -> result_type:
function_content
# applying function
DataFrame.select(function_name(specific_DataFrame_column)).show()

Example 1: Adding ‘s’ to every element in the column of DataFrame

Here in, we will be applying a function that will return the same elements but an additional ‘s’ added to them. Let’s look at the steps:

  1. Import PySpark module
  2. Import pandas_udf from pyspark.sql.functions.
  3. Initialize the SparkSession.
  4. Use the pandas_udf as the decorator.
  5. Define the function.
  6. Create a DataFrame.
  7. Use .select method over the DataFrame
  8.  and as its argument, type-in the function_name along with its parameter as the specific column you want to apply the function on.

Python3




# importing SparkSession to initialize session
from pyspark.sql import SparkSession
# importing pandas_udf
from pyspark.sql.functions import pandas_udf
# importing Row to create DataFrame
from pyspark import Row
  
# initialising spark session
spark = SparkSession.builder.getOrCreate()
  
# creating DataFrame
df = spark.createDataFrame([
      Row(fruits='apple', quantity=1),
      Row(fruits='banana', quantity=2),
      Row(fruits='orange', quantity=4)
])
  
# printing our created DataFrame
df.show()


Output:

+------+--------+
|fruits|quantity|
+------+--------+
| apple|       1|
|banana|       2|
|orange|       4|
+------+--------+

Now, let’s apply the function to the ‘fruits’ columns of this DataFrame.

Python3




# pandas UDF with the function Type as 'String'
@pandas_udf('string'
def adding_s(s: pd.Series) -> pd.Series: # function
  return (s +'s') # concatenating the element string and 's'
  
# applying the above function on the
# 'fruits' column of 'df' DataFrame
df.select(adding_s('fruits')).show()


Output:

+----------------+
|adding_s(fruits)|
+----------------+
|          apples|
|         bananas|
|         oranges|
+----------------+

Example 2: Capitalizing each element in the ‘fruits’ column

Herein, we will capitalize each element in the ‘fruits’ columns of the same DataFrame from the last example. Let’s look at the steps to do that:

  • Use the pandas_udf as the decorator.
  • Define the function.
  • Use .select method over the DataFrame and as its argument, type-in the function_name along with its parameter as the specific column we want to apply the function on.

Python3




@pandas_udf('string')
def capitalize(s1: pd.Series) -> pd.Series:
  # Here we are using s1.'str'.capitalize() as 
  # s1 is a pandas Series object and it 
  # doesn't contain capitalize() method. 
  # It is a string method, that's why we have written
  # s1.str.capitalize()
  return (s1.str.capitalize())
  
df.select(capitalize('fruits')).show()


Output:

+------------------+
|capitalize(fruits)|
+------------------+
|             Apple|
|            Banana|
|            Orange|
+------------------+

Example 3: Square of each element in the ‘quantity’ column of ‘df’ DataFrame

Herein, we will create a function that will return the squares of numbers in the ‘quantity’ column. Let’s look at the steps:

  • Import Iterator from typing.
  • Use pandas_udf() as Decorator.
  • Define the Function.
  • Use .select method over the DataFrame and as its argument, type-in the function_name along with its parameter as the specific column you want to apply the function on.

Python3




from typing import Iterator
  
@pandas_udf('long')
def square(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
  for x in iterator:
    yield x*x
  
df.select(square('quantity')).show()


Output:

+----------------+
|square(quantity)|
+----------------+
|               1|
|               4|
|              16|
+----------------+

Example 4: Multiplying Each element of ‘quantity’ column with 10

We will follow all the same steps as above but we will change the function slightly.

Python3




from typing import Iterator
  
@pandas_udf('long')
def multiply_by_10(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
  for x in iterator:
    # multiplying each element by 10
    yield x*10 
      
df.select(multiply_by_10('quantity')).show()


Output:

+------------------------+
|multiply_by_10(quantity)|
+------------------------+
|                      10|
|                      20|
|                      40|
+------------------------+

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments