PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.
Conversion between PySpark and Pandas DataFrames
In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.
Converting Pandas DataFrame into a PySpark DataFrame
Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.
Example:
Python3
# importing pandas and PySpark libraries import pandas as pd import pyspark # initializing the PySpark session spark = pyspark.sql.SparkSession.builder.getOrCreate() # creating a pandas DataFrame df = pd.DataFrame({ 'Cardinal' :[ 1 , 2 , 3 ], 'Ordinal' :[ 'First' , 'Second' , 'Third' ] }) # converting the pandas DataFrame into a PySpark DataFrame df = spark.createDataFrame(df) # printing the first two rows df.show( 2 ) |
Output:
In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.
Converting PySpark DataFrame into a Pandas DataFrame
Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.
Syntax to use toPandas() method:
spark_DataFrame.toPandas()
Example:
Python3
# importing PySpark Library import pyspark # from PySpark importing Row for creating DataFrame from pyspark import Row # initializing PySpark session spark = pyspark.sql.SparkSession.builder.getOrCreate() # creating a PySpark DataFrame spark_df = spark.createDataFrame([ Row(Cardinal = 1 , Ordinal = 'First' ), Row(Cardinal = 2 , Ordinal = 'Second' ), Row(Cardinal = 3 , Ordinal = 'Third' ) ]) # converting spark_dataframe into a pandas DataFrame pandas_df = spark_df.toPandas() pandas_df.head() |
Output:
Now we will check the time required to do the above conversion.
Python3
% % time import numpy as np import pandas as pd # creating session in PySpark spark = pyspark.sql.SparkSession.builder.getOrCreate() # creating a PySpark DataFrame spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\ (np.random.randint( 1 , 101 , size = 100 ), newshape = ( 10 , 10 )))) spark_df.toPandas() |
Output
3.17 s
Now let’s enable the PyArrow and see the time taken by the process.
Python3
% % time import numpy as np import pandas as pd # creating session in PySpark spark = pyspark.sql.SparkSession.builder.getOrCreate() # creating a PySpark DataFrame spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\ (np.random.randint( 1 , 101 , size = 100 ), newshape = ( 10 , 10 )))) # enabling PyArrow spark.conf. set ( 'spark.sql.execution.arrow.enabled' , 'true' ) spark_df.toPandas() |
Output
460 ms
Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.