Scraping is a very essential skill for everyone to get data from any website. Scraping and parsing a table can be very tedious work if we use standard Beautiful soup parser to do so. Therefore, here we will be describing a library with the help of which any table can be scraped from any website easily. With this method you don’t even have to inspect element of a website, you only have to provide the URL of the website. That’s it and the work will be done within seconds.
Installation
You can use pip to install this library:
pip install html-table-parser-python3
Getting Started
Step 1: Import the necessary libraries required for the task
# Library for opening url and creating # requests import urllib.request # pretty-print python data structures from pprint import pprint # for parsing all the tables present # on the website from html_table_parser.parser import HTMLTableParser # for converting the parsed data in a # pandas dataframe import pandas as pd
Step 2 : Defining a function to get contents of the website
# Opens a website and read its
# binary contents (HTTP Response Body)
def url_get_contents(url):
# Opens a website and read its
# binary contents (HTTP Response Body)
#making request to the website
req = urllib.request.Request(url=url)
f = urllib.request.urlopen(req)
#reading contents of the website
return f.read()
Now, our function is ready so we have to specify the url of the website from which we need to parse tables.
Note: Here we will be taking the example of moneycontrol.com website since it has many tables and will give you a better understanding. You can view the website here .Â
Step 3 : Parsing tables
# defining the html contents of a URL.
xhtml = url_get_contents('Link').decode('utf-8')
# Defining the HTMLTableParser object
p = HTMLTableParser()
# feeding the html contents in the
# HTMLTableParser object
p.feed(xhtml)
# Now finally obtaining the data of
# the table required
pprint(p.tables[1])
Each row of the table is stored in an array. This can be converted into a pandas dataframe easily and can be used to perform any analysis.Â
Complete Code:
Python3
# Library for opening url and creating# requestsimport urllib.requestÂ
# pretty-print python data structuresfrom pprint import pprintÂ
# for parsing all the tables present# on the websitefrom html_table_parser.parser import HTMLTableParserÂ
# for converting the parsed data in a# pandas dataframeimport pandas as pdÂ
Â
# Opens a website and read its# binary contents (HTTP Response Body)def url_get_contents(url):Â
    # Opens a website and read its    # binary contents (HTTP Response Body)Â
    #making request to the website    req = urllib.request.Request(url=url)    f = urllib.request.urlopen(req)Â
    #reading contents of the website    return f.read()Â
# defining the html contents of a URL.xhtml = url_get_contents('https://www.moneycontrol.com/india\/stockpricequote/refineries/relianceindustries/RI').decode('utf-8')Â
# Defining the HTMLTableParser objectp = HTMLTableParser()Â
# feeding the html contents in the# HTMLTableParser objectp.feed(xhtml)Â
# Now finally obtaining the data of# the table requiredpprint(p.tables[1])Â
# converting the parsed data to# dataframeprint("\n\nPANDAS DATAFRAME\n")print(pd.DataFrame(p.tables[1])) |
Output:
Â
Â

