PySpark – Read CSV file into DataFrame

27 July 2024

1

In this article, we are going to see how to read CSV files into Dataframe. For this, we will use Pyspark and Python.

Files Used:

Read CSV File into DataFrame

Here we are going to read a single CSV into dataframe using spark.read.csv and then create dataframe with this data using .toPandas().

Python3

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName(
    'Read CSV File into DataFrame').getOrCreate()
 
authors = spark.read.csv('/content/authors.csv', sep=',',
                         inferSchema=True, header=True)
 
df = authors.toPandas()
df.head()

Output:

Here, we passed our CSV file authors.csv. Second, we passed the delimiter used in the CSV file. Here the delimiter is comma ‘,‘. Next, we set the inferSchema attribute as True, this will go through the CSV file and automatically adapt its schema into PySpark Dataframe. Then, we converted the PySpark Dataframe to Pandas Dataframe df using toPandas() method.

Read Multiple CSV Files

To read multiple CSV files, we will pass a python list of paths of the CSV files as string type.

Python3

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName('Read Multiple CSV Files').getOrCreate()
 
path = ['/content/authors.csv',
        '/content/book_author.csv']
 
files = spark.read.csv(path, sep=',',
                       inferSchema=True, header=True)
 
df1 = files.toPandas()
display(df1.head())
display(df1.tail())

Output:

Here, we imported authors.csv and book_author.csv present in the same current working directory having delimiter as comma ‘,‘ and the first row as Header.

Read All CSV Files in Directory

To read all CSV files in the directory, we will use * for considering each file in the directory.

Python3

from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName(
    'Read All CSV Files in Directory').getOrCreate()
 
file2 = spark.read.csv('/content/*.csv', sep=',',
                    inferSchema=True, header=True)
 
df1 = file2.toPandas()
display(df1.head())
display(df1.tail())

Output:

This will read all the CSV files present in the current working directory, having delimiter as comma ‘,‘ and the first row as Header.

PySpark – Read CSV file into DataFrame

Read CSV File into DataFrame

Python3

Read Multiple CSV Files

Python3

Read All CSV Files in Directory

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Keep Kids Safe on YouTube in 2025: Complete Guide by Tim Mocan

What Is a Firewall & Can It Protect Your Device? 2025 Guide by Katarina Glamoslija

How to View Secret Conversations on Messenger in 2025 by Kristel van Hoof

How to Block a Website on Google Chrome: 2025 Guide by Kristel van Hoof

Recent Comments

EDITOR PICKS

How to Keep Kids Safe on YouTube in 2025: Complete Guide by Tim Mocan

What Is a Firewall & Can It Protect Your Device? 2025 Guide by Katarina Glamoslija

How to View Secret Conversations on Messenger in 2025 by Kristel van Hoof

POPULAR POSTS

How to Keep Kids Safe on YouTube in 2025: Complete Guide by Tim Mocan

What Is a Firewall & Can It Protect Your Device? 2025 Guide by Katarina Glamoslija

How to View Secret Conversations on Messenger in 2025 by Kristel van Hoof

POPULAR CATEGORY

ABOUT US

FOLLOW US