Python is a high-level, general-purpose, and very popular programming language. Python programming language (the latest Python 3) is being used in web development, Machine Learning applications, along with all cutting-edge technology in Software Industry. Python Programming Language is very well suited for Beginners, also for experienced programmers with other programming languages like C++ and Java.
In this article, we will learn how to convert a PDF File to CSV File Using Python. Here we will discuss various methods for conversion. For all methods, we are using an input PDF file.
There is a tool called UPDF that can be used to convert a PDF file to CSV file.
Method 1:
Here will use the pdftables_api Module for converting the PDF file into any other format. The pdftables_api module is used for reading the tables in a PDF. It also allows us to convert PDF Files into another format.
Installation:
Open Command Prompt and type "pip install git+https://github.com/pdftables/python-pdftables-api.git"
- It will install the pdftables_api Module
- After Installation, you need an API KEY.
- Go to PDFTables.com and signup, then visit the API Page to see your API KEY.
Approach:
- Verify the API key.
- For Converting PDF File Into CSV File we will use csv() method.
Syntax:
pdftables_api.Client('API KEY').csv(pdf_path, csv_path)
Below is the Implementation:
PDF File Used:
Python3
# Import Module import pdftables_api # API KEY VERIFICATION conversion = pdftables_api.Client( 'API KEY' ) # PDf to CSV # (Hello.pdf, Hello) conversion.csv(pdf_file_path, output_file_path) |
Output:
Method 2:
Here will use the tabula-py Module for converting the PDF file into any other format. The tabula-py is a simple Python wrapper of tabula-java, which can read tables in a PDF. You can read tables from a PDF and convert them into a pandas DataFrame. tabula-py also enables you to convert a PDF file into a CSV, a TSV, or a JSON file.
Installation:
pip install tabula-py
Before we start, first we need to install java and add a java installation folder to the PATH variable.
- Install java click here
- Add java installation folder (C:\Program Files (x86)\Java\jre1.8.0_251\bin) to the environment path variable
Approach:
- Read PDF file using read_pdf() method.
- Then we will convert the PDF files into a CSV file using the to_csv() method.
Syntax:
read_pdf(PDF File Path, pages = Number of pages, **agrs)
Below is the Implementation:
PDF File Used:
Python3
# Import Module import tabula # Read PDF File # this contain a list df = tabula.read_pdf(PDF File Path, pages = 1 )[ 0 ] # Convert into Excel File df.to_csv( 'Excel File Path' ) |
Output: