In this article, we are going to learn how to randomly split data frame using PySpark in Python.
A tool created by Apache Spark Community to use Python with Spark altogether is known as Pyspark. Thus, while working for the Pyspark data frame, Sometimes we are required to randomly split that data frame. In this article, we are going to achieve this using randomSplit() function of Pyspark. This function not only splits the data frame as per the fraction but always gives us different values when the function is run.
randomSplit() function:
Syntax: data_frame.randomSplit(weights, seed=None)
Parameters:
- weights: The list of double values in which the data frame will be split.
- seed: The seed for sampling which divides the data frame always in same fractional parts until the seed value or weights value is changed.
Prerequisite
Note: In the article about installing Pyspark we have to install python instead of scala rest of the steps are the same.
Modules Required:
Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Stepwise Implementation:
Step 1: First of all, import the required libraries, i.e. SparkSession. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('Path_to_csv_file', sep = ',', inferSchema = True, header = True) data_frame.show()
Step 4: Next, split the data frame randomly using randomSplit function having weights and seeds as arguments. Further, store the split data frame either in the list or different variables.
splits=data_frame.randomSplit(weights, seed=None)
Step 5: Finally, display the list elements or the variables to see how the data frame is split.
splits[0].count() splits[1].count()
Example 1:
In this example, we have split the data frame (link) through randomSplit function by only weights as an argument and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got different values each time.
Python3
# Python program to show random sampling of # Pyspark data frame without seed as argument # and storing the result in list # Import the SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file # Here CSV is in the same folder data_frame = csv_file = spark_session.read.csv( 'california_housing_train.csv' , sep = ',' , inferSchema = True , header = True ) # Display the csv file read data_frame.show() # Split the dataframe into 2 parts, split1 & split2 # with only weights as argument so that dataframe is # always split in different fractional parts splits = data_frame.randomSplit([ 1.0 , 3.0 ]) # Checking the count of 1st value of splitted dataframe splits[ 0 ].count() # Checking the count of 2nd value of splitted dataframe splits[ 1 ].count() # Again split the dataframe into 2 parts, split1 & split2 # with weights & seed as argument to know if splitting # value changes or remains same splits = data_frame.randomSplit([ 1.0 , 3.0 ]) # Checking the count of 1st value of splitted dataframe splits[ 0 ].count() # Checking the count of 2nd value of splitted dataframe splits[ 1 ].count() |
Output:
4233 12767 4202 12798
Example 2:
In this example, we have split the data frame (link) through randomSplit function by weights as well as seed as arguments and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got the same values each time.
Python3
# Python program to show random sampling of # Pyspark data frame with seed as argument # and storing the result in list # Import the SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file # Here CSV file saved in same folder data_frame = csv_file = spark_session.read.csv( 'california_housing_train.csv' , sep = ',' , inferSchema = True , header = True ) # Display the csv file read data_frame.show() # Split the dataframe into 2 parts, split1 & split2 # with weights & seed as argument so that dataframe is # always split in same fractional parts splits = data_frame.randomSplit([ 1.0 , 3.0 ], 26 ) # Checking the count of 1st value of splitted dataframe splits[ 0 ].count() # Checking the count of 2nd value of splitted dataframe splits[ 1 ].count() # Again split the dataframe into 2 parts, split1 & split2 # with weights & seed as argument to know if splitting # value changes or remains same splits = data_frame.randomSplit([ 1.0 , 3.0 ], 26 ) # Checking the count of 1st value of splitted dataframe splits[ 0 ].count() # Checking the count of 2nd value of splitted dataframe splits[ 1 ].count() |
Output:
4181 12819 4181 12819
Example 3:
In this example, we have split the data frame (link) through randomSplit function by only weights as an argument and stored it in the list. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got different values each time.
Python3
# Python program to show random sampling of # Pyspark data frame without seed as argument # and storing the result in variables # Import the SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file # Here CSV file is saved in same folder data_frame = csv_file = spark_session.read.csv( 'california_housing_train.csv' , sep = ',' , inferSchema = True , header = True ) # Display the csv file read data_frame.show() # Split the dataframe into 2 parts, split1 & split2 # with only weights as argument so that dataframe is # always split in different fractional parts split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ]) # Checking the count of split1 split1.count() # Checking the count of split2 split2.count() # Again split the dataframe into 2 parts, split1 & split2 # with weights & seed as argument to know if splitting # value changes or remains same split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ]) # Checking the count of split1 split1.count() # Checking the count of split2 split2.count() |
Output:
2818 14182 2783 14217
Example 4:
In this example, we have split the data frame (link) through randomSplit function by weights as well as seed as arguments and stored it in the variables. We have split the data frame twice through randomSplit function to see if we get the same fractional value each time. What we observed is that we got the same values each time.
Python3
# Python program to show random sampling of # Pyspark data frame with seed as argument # and storing the result in variables # Import the SparkSession library from pyspark.sql import SparkSession # Create a spark session using getOrCreate() function spark_session = SparkSession.builder.getOrCreate() # Read the CSV file # Here csv file is saved in same folder data_frame = csv_file = spark_session.read.csv( 'california_housing_train.csv' , sep = ',' , inferSchema = True , header = True ) # Display the csv file read data_frame.show() # Split the dataframe into 2 parts, split1 & split2 # with weights & seed as argument so that dataframe is # always split in same fractional parts split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ], 24 ) # Checking the count of split1 split1.count() # Checking the count of split2 split2.count() # Again split the dataframe into 2 parts, split1 & split2 # with weights & seed as argument to know if splitting # value changes or remains same split1, split2 = data_frame.randomSplit([ 1.0 , 5.0 ], 24 ) # Checking the count of split1 split1.count() # Checking the count of split2 split2.count() |
Output:
2776 14224 2776 14224