Friday, September 5, 2025
HomeLanguagesOptimize Conversion between PySpark and Pandas DataFrames

Optimize Conversion between PySpark and Pandas DataFrames

PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.

Conversion between PySpark and Pandas DataFrames

In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.

Converting Pandas DataFrame into a PySpark DataFrame

Here in, we’ll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we’ll import PySpark and Pandas libraries. Then we’ll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we’ll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we’ll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.

Example:

Python3




# importing pandas and PySpark libraries
import pandas as pd
import pyspark
  
# initializing the PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a pandas DataFrame
df = pd.DataFrame({
  'Cardinal':[1, 2, 3],
  'Ordinal':['First','Second','Third']
})
  
# converting the pandas DataFrame into a PySpark DataFrame
df = spark.createDataFrame(df)
  
# printing the first two rows
df.show(2)


Output:

 

In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.

Converting PySpark DataFrame into a Pandas DataFrame

Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we’ll be making use of the toPandas() method. We’ll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.

Syntax to use toPandas() method:

spark_DataFrame.toPandas()

Example:

Python3




# importing PySpark Library
import pyspark
  
# from PySpark importing Row for creating DataFrame
from pyspark import Row
  
# initializing PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame([
  Row(Cardinal=1, Ordinal='First'),
  Row(Cardinal=2, Ordinal='Second'),
  Row(Cardinal=3, Ordinal='Third')
])
  
# converting spark_dataframe into a pandas DataFrame
pandas_df = spark_df.toPandas()
  
pandas_df.head()


Output:

 

Now we will check the time required to do the above conversion.

Python3




%%time
import numpy as np
import pandas as pd
  
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
           (np.random.randint(1, 101, size=100), newshape=(10, 10))))
spark_df.toPandas()


Output

3.17 s

Now let’s enable the PyArrow and see the time taken by the process.

Python3




%%time
import numpy as np
import pandas as pd
  
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
  
  
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
           (np.random.randint(1, 101, size=100), newshape=(10, 10))))
  
# enabling PyArrow
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')
spark_df.toPandas()


Output

460 ms

Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.

Dominic
Dominichttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Dominic
32265 POSTS0 COMMENTS
Milvus
81 POSTS0 COMMENTS
Nango Kala
6634 POSTS0 COMMENTS
Nicole Veronica
11801 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11863 POSTS0 COMMENTS
Shaida Kate Naidoo
6752 POSTS0 COMMENTS
Ted Musemwa
7025 POSTS0 COMMENTS
Thapelo Manthata
6703 POSTS0 COMMENTS
Umr Jansen
6718 POSTS0 COMMENTS