Friday, December 27, 2024
Google search engine
HomeLanguagesConvert PDF to CSV using Python

Convert PDF to CSV using Python

Python is a high-level, general-purpose, and very popular programming language. Python programming language (the latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.

In this article, we will learn how to convert a PDF File to CSV File Using Python. Here we will discuss various methods for conversion. For all methods, we are using an input PDF file.

There is a tool called UPDF that can be used to convert a PDF file to CSV file.

Method 1:

Here will use the pdftables_api Module for converting the PDF file into any other format. The pdftables_api module is used for reading the tables in a PDF. It also allows us to convert PDF Files into another format.

Installation:

Open Command Prompt and type "pip install git+https://github.com/pdftables/python-pdftables-api.git"
  • It will install the pdftables_api Module
  • After Installation, you need an API KEY.
  • Go to PDFTables.com and signup, then visit the API Page to see your API KEY.

Approach:

  • Verify the API key.
  • For Converting PDF File Into CSV File we will use csv() method.

Syntax:

pdftables_api.Client('API KEY').csv(pdf_path, csv_path)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3




# Import Module
import pdftables_api
  
# API KEY VERIFICATION
conversion = pdftables_api.Client('API KEY')
  
# PDf to CSV 
# (Hello.pdf, Hello)
conversion.csv(pdf_file_path, output_file_path)


Output:

CSV FILE

Method 2:

Here will use the tabula-py Module for converting the PDF file into any other format. The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV, or a JSON file.

Installation:

pip install tabula-py

Before we start, first we need to install java and add a java installation folder to the PATH variable.

  • Install java click here
  • Add java installation folder (C:\Program Files (x86)\Java\jre1.8.0_251\bin) to the environment path variable

Approach:

  • Read PDF file using read_pdf() method.
  • Then we will convert the PDF files into a CSV file using the to_csv() method.

Syntax:

read_pdf(PDF File Path, pages = Number of pages, **agrs)

Below is the Implementation:

PDF File Used:

PDF FILE

Python3




# Import Module 
import tabula
  
# Read PDF File
# this contain a list
df = tabula.read_pdf(PDF File Path, pages = 1)[0]
  
# Convert into Excel File
df.to_csv('Excel File Path')


Output:

CSV FILE

RELATED ARTICLES

Most Popular

Recent Comments