Monday, September 23, 2024
Google search engine
HomeLanguagesParsing tables and XML with BeautifulSoup

Parsing tables and XML with BeautifulSoup

Perquisites:  Web scraping using Beautiful soup, XML Parsing

Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to Extract a Table from a website and XML from a file.
Here, we will scrap data using the Beautiful Soup Python Module.

Modules Required:

  • bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be installed using the below command:
pip install bs4
  • lxml: It is a Python library that allows us to handle XML and HTML files. It can be installed using the below command:
pip install lxml
  • request: Requests allows you to send HTTP/1.1 requests extremely easily. It can be installed using the below command:
pip install request

Step-by-step Approach to parse Tables:

Step 1: Firstly, we need to import modules and then assign the URL.

Python3




# import required modules
import bs4 as bs
import requests
 
# assign URL


Step 2: Create a BeautifulSoup object for parsing.

Python3




# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")


Step 3: Then find the table and its rows. 

Python3




# find all tables
find_table = file.find('table', class_='numpy-table')
rows = find_table.find_all('tr')


Step 4: Now create a loop to find all the td tags in the table and then print all the table data tags.

Python3




# display tables
for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)


Below is the complete program based on the above approach:

Python3




# import required modules
import bs4 as bs
import requests
 
# assign URL
 
# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")
 
# find all tables
find_table = file.find('table', class_='numpy-table')
rows = find_table.find_all('tr')
 
# display tables
for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)


Output:

Step-by-step Approach to parse XML files:

Step 1: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.

<?xml version="1.0" ?>
<books>
  <book>
    <title>Introduction of GeeksforLazyroar V1</title>
    <author>Gfg</author>
    <price>6.99</price>
  </book>
  <book>
    <title>Introduction of GeeksforLazyroar V2</title>
    <author>Gfg</author>
    <price>8.99</price>
  </book>
  <book>
    <title>Introduction of GeeksforLazyroar V2</title>
    <author>Gfg</author>
    <price>9.35</price>
  </book>
</books>

Step 2: Create a python file and import modules.

Python3




# import required modules
from bs4 import BeautifulSoup


Step 3: Read the content of the XML.

Python3




# reading content
file = open("test1.xml", "r")
contents = file.read()


Step 4: Parse the content of the XML.

Python3




# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')


Step 5: Display the content of the XML file.

Python3




# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')


Below is the complete program based on the above approach:

Python3




# import required modules
from bs4 import BeautifulSoup
 
# reading content
file = open("test1.xml", "r")
contents = file.read()
 
# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')
 
# display content
for data in titles:
    print(data.get_text())


Output:

RELATED ARTICLES

Most Popular

Recent Comments