A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python.
There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either using the filter function or the where function. In this article, we will discuss both ways to split data frames by column value.
Ways to split Pyspark data frame by column value:
- Using filter function
- Using where function
Method 1: Using the filter function
The function used to filter the rows from the data frame based on the given condition or SQL expression is known as the filter function. In this way, we will see how we can split the data frame by column value using the filter function. What we will do is apply the condition in the filter function once with equal to and next with not equal to and display both the data frames.
Syntax: data_frame.filter(condition)
Example:
In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we split the data frame with column ‘Age‘ using the filter function when its value is 18 and when it is not. Finally, we displayed both data frames.
Python3
# PySpark - Split dataframe by # column value using filter function # Import the libraries SparkSession from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file df = csv_file = spark_session.read.csv( 'student_data.csv' , sep = ',' , inferSchema = True , header = True ) # Split data frame with age when value is 18 df. filter (df.age = = 18 ).show(truncate = False ) df. filter (df.age ! = 18 ).show(truncate = False ) |
Output:
Method 2: Using the where function
The function used to filter the rows from the data frame based on the given SQL expression or condition is known as the where function. In this way, we will see how we can split the data frame by column value using the where function. What we will do is apply the condition in the where function once with equal to and next with not equal to and display both the data frames.
Syntax: data_frame.where(condition)
In this example, we have read a CSV file (link), i.e., basically a data set of 5*5 as follows:
Then, we split the data frame with column ‘Age‘ using the where function when its value is 18 and when it is not. Finally, we displayed both data frames.
Python3
# PySpark - Split dataframe by column value using where function # Import the libraries SparkSession from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file df = csv_file = spark_session.read.csv( 'student_data.csv' , sep = ',' , inferSchema = True , header = True ) # Split data frame with age when value is 18 df.where(df.age = = 18 ).show(truncate = False ) df.where(df.age ! = 18 ).show(truncate = False ) |
Output: