Perquisites: Web scraping using Beautiful soup, XML Parsing
Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to Extract a Table from a website and XML from a file.
Here, we will scrap data using the Beautiful Soup Python Module.
Modules Required:
- bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be installed using the below command:
pip install bs4
- lxml: It is a Python library that allows us to handle XML and HTML files. It can be installed using the below command:
pip install lxml
- request: Requests allows you to send HTTP/1.1 requests extremely easily. It can be installed using the below command:
pip install request
Step-by-step Approach to parse Tables:
Step 1: Firstly, we need to import modules and then assign the URL.
Python3
# import required modules import bs4 as bs import requests # assign URL |
Step 2: Create a BeautifulSoup object for parsing.
Python3
# parsing url_link = requests.get(URL) file = bs.BeautifulSoup(url_link.text, "lxml" ) |
Step 3: Then find the table and its rows.
Python3
# find all tables find_table = file .find( 'table' , class_ = 'numpy-table' ) rows = find_table.find_all( 'tr' ) |
Step 4: Now create a loop to find all the td tags in the table and then print all the table data tags.
Python3
# display tables for i in rows: table_data = i.find_all( 'td' ) data = [j.text for j in table_data] print (data) |
Below is the complete program based on the above approach:
Python3
# import required modules import bs4 as bs import requests # assign URL # parsing url_link = requests.get(URL) file = bs.BeautifulSoup(url_link.text, "lxml" ) # find all tables find_table = file .find( 'table' , class_ = 'numpy-table' ) rows = find_table.find_all( 'tr' ) # display tables for i in rows: table_data = i.find_all( 'td' ) data = [j.text for j in table_data] print (data) |
Output:
Step-by-step Approach to parse XML files:
Step 1: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.
<?xml version="1.0" ?> <books> <book> <title>Introduction of GeeksforLazyroar V1</title> <author>Gfg</author> <price>6.99</price> </book> <book> <title>Introduction of GeeksforLazyroar V2</title> <author>Gfg</author> <price>8.99</price> </book> <book> <title>Introduction of GeeksforLazyroar V2</title> <author>Gfg</author> <price>9.35</price> </book> </books>
Step 2: Create a python file and import modules.
Python3
# import required modules from bs4 import BeautifulSoup |
Step 3: Read the content of the XML.
Python3
# reading content file = open ( "test1.xml" , "r" ) contents = file .read() |
Step 4: Parse the content of the XML.
Python3
# parsing soup = BeautifulSoup(contents, 'xml' ) titles = soup.find_all( 'title' ) |
Step 5: Display the content of the XML file.
Python3
# parsing soup = BeautifulSoup(contents, 'xml' ) titles = soup.find_all( 'title' ) |
Below is the complete program based on the above approach:
Python3
# import required modules from bs4 import BeautifulSoup # reading content file = open ( "test1.xml" , "r" ) contents = file .read() # parsing soup = BeautifulSoup(contents, 'xml' ) titles = soup.find_all( 'title' ) # display content for data in titles: print (data.get_text()) |
Output: