How to slice a PySpark dataframe in two row-wise dataframe?

27 July 2024

2

In this article, we are going to learn how to slice a PySpark DataFrame into two row-wise. Slicing a DataFrame is getting a subset containing all rows from one index to another.

Method 1: Using limit() and subtract() functions

In this method, we first make a PySpark DataFrame with precoded data using createDataFrame(). We then use limit() function to get a particular number of rows from the DataFrame and store it in a new variable. The syntax of limit function is :

Syntax : DataFrame.limit(num)

Returns : A DataFrame with num number of rows.

We will then use subtract() function to get the remaining rows from the initial DataFrame. The syntax of subtract function is :

Syntax : DataFrame1.subtract(DataFrame2)

Returns : A new DataFrame containing rows in DataFrame1 but not in DataFrame2.

Python

# Importing PySpark 
import pyspark 
from pyspark.sql import SparkSession 
  
# Session Creation 
Spark_Session = SparkSession.builder.appName( 
    'Spark Session'
).getOrCreate() 
  
  
# Data filled in our DataFrame 
rows = [['Lee Chong Wei', 69, 'Malaysia'], 
        ['Lin Dan', 66, 'China'], 
        ['Srikanth Kidambi', 9, 'India'], 
        ['Kento Momota', 15, 'Japan']] 
  
# Columns of our DataFrame 
columns = ['Player', 'Titles', 'Country'] 
  
# DataFrame is created 
df = Spark_Session.createDataFrame(rows, columns) 
  
# Getting the slices 
# The first slice has 3 rows 
df1 = df.limit(3) 
  
# Getting the second slice by removing df1 
# from df 
df2 = df.subtract(df1) 
  
# Printing the first slice 
df1.show() 
  
# Printing the second slice. 
df2.show() 

Output:

Method 2: Using randomSplit() function

In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.

Syntax : DataFrame.randomSplit(weights,seed)

Parameters :

weights : list of double values according to which the DataFrame is split.

seed : the seed for sampling. This parameter is optional.

Returns : List of split DataFrames

Python

# Importing PySpark 
import pyspark 
from pyspark.sql import SparkSession 
  
# Session Creation 
Spark_Session = SparkSession.builder.appName( 
    'Spark Session'
).getOrCreate() 
  
  
# Data filled in our DataFrame 
rows = [['Lee Chong Wei', 69, 'Malaysia'], 
        ['Lin Dan', 66, 'China'], 
        ['Srikanth Kidambi', 9, 'India'], 
        ['Kento Momota', 15, 'Japan']] 
  
# Columns of our DataFrame 
columns = ['Player', 'Titles', 'Country'] 
  
#DataFrame is created 
df = Spark_Session.createDataFrame(rows, columns) 
  
# the first slice has 20% of the rows 
# the second slice has 80% of the rows 
# the data in both slices is selected randomly 
df1, df2 = df.randomSplit([0.20, 0.80]) 
  
# Showing the first slice 
df1.show() 
  
# Showing the second slice 
df2.show() 

Output:

Method 3: Using collect() function

In this method, we will first make a PySpark DataFrame using createDataFrame(). We will then get a list of Row objects of the DataFrame using :

DataFrame.collect()

We will then use Python List slicing to get two lists of Rows. Finally, we convert these two lists of rows to PySpark DataFrames using createDataFrame().

Python

# Importing PySpark and Pandas 
import pyspark 
from pyspark.sql import SparkSession 
import pandas as pd 
  
# Session Creation 
Spark_Session = SparkSession.builder.appName( 
    'Spark Session'
).getOrCreate() 
  
  
# Data filled in our DataFrame 
rows = [['Lee Chong Wei', 69, 'Malaysia'], 
        ['Lin Dan', 66, 'China'], 
        ['Srikanth Kidambi', 9, 'India'], 
        ['Kento Momota', 15, 'Japan']] 
  
# Columns of our DataFrame 
columns = ['Player', 'Titles', 'Country'] 
  
#DataFrame is created 
df = Spark_Session.createDataFrame(rows, columns) 
  
# getting the list of Row objects 
row_list = df.collect() 
  
# Slicing the Python List 
part1 = row_list[:1] 
part2 = row_list[1:] 
  
# Converting the slices to PySpark DataFrames 
slice1 = Spark_Session.createDataFrame(part1) 
slice2 = Spark_Session.createDataFrame(part2) 
  
# Printing the first slice 
print('First DataFrame') 
slice1.show() 
  
# Printing the second slice 
print('Second DataFrame') 
slice2.show() 

Output:

Method 4: Converting PySpark DataFrame to a Pandas DataFrame and using iloc[] for slicing

In this method, we will first make a PySpark DataFrame using createDataFrame(). We will then convert it into a Pandas DataFrame using toPandas(). We then slice the DataFrame using iloc[] with the Syntax :

DataFrame.iloc[start_index:end_index]

The row at end_index is NOT included. Finally, we will convert our DataFrame slices to a PySpark DataFrame using createDataFrame()

Python

# Importing PySpark and Pandas 
import pyspark 
from pyspark.sql import SparkSession 
import pandas as pd 
  
# Session Creation 
Spark_Session = SparkSession.builder.appName( 
    'Spark Session'
).getOrCreate() 
  
  
# Data filled in our DataFrame 
rows = [['Lee Chong Wei', 69, 'Malaysia'], 
        ['Lin Dan', 66, 'China'], 
        ['Srikanth Kidambi', 9, 'India'], 
        ['Kento Momota', 15, 'Japan']] 
  
# Columns of our DataFrame 
columns = ['Player', 'Titles', 'Country'] 
  
# DataFrame is created 
df = Spark_Session.createDataFrame(rows, columns) 
  
# Converting DataFrame to pandas 
pandas_df = df.toPandas() 
  
# First DataFrame formed by slicing 
df1 = pandas_df.iloc[:2] 
  
# Second DataFrame formed by slicing 
df2 = pandas_df.iloc[2:] 
  
# Converting the slices to PySpark DataFrames 
df1 = Spark_Session.createDataFrame(df1) 
df2 = Spark_Session.createDataFrame(df2) 
  
# Printing the first slice 
print('First DataFrame') 
df1.show() 
  
# Printing the second slice 
print('Second DataFrame') 
df2.show() 

Output:

How to slice a PySpark dataframe in two row-wise dataframe?

Method 1: Using limit() and subtract() functions

Python

Method 2: Using randomSplit() function

Python

Method 3: Using collect() function

Python

Method 4: Converting PySpark DataFrame to a Pandas DataFrame and using iloc[] for slicing

Python

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Recent Comments

EDITOR PICKS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR POSTS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US