Parsing tables and XML with BeautifulSoup

27 July 2024

0

Perquisites: Web scraping using Beautiful soup, XML Parsing

Scraping is a very essential skill that everybody should learn, It helps us to scrap data from a website or a file that can be used in another beautiful manner by the programmer. In this article, we will learn how to Extract a Table from a website and XML from a file.
Here, we will scrap data using the Beautiful Soup Python Module.

Modules Required:

bs4: Beautiful Soup is a Python library for pulling data out of HTML and XML files. It can be installed using the below command:

pip install bs4

lxml: It is a Python library that allows us to handle XML and HTML files. It can be installed using the below command:

pip install lxml

request: Requests allows you to send HTTP/1.1 requests extremely easily. It can be installed using the below command:

pip install request

Step-by-step Approach to parse Tables:

Step 1: Firstly, we need to import modules and then assign the URL.

Python3

# import required modules
import bs4 as bs
import requests
 
# assign URL
URL = 'https://www.geeksforgeeks.org/python-list/'

Step 2: Create a BeautifulSoup object for parsing.

Python3

# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")

Step 3: Then find the table and its rows.

Python3

# find all tables
find_table = file.find('table', class_='numpy-table')
rows = find_table.find_all('tr')

Step 4: Now create a loop to find all the td tags in the table and then print all the table data tags.

Python3

# display tables
for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)

Below is the complete program based on the above approach:

Python3

# import required modules
import bs4 as bs
import requests
 
# assign URL
URL = 'https://www.geeksforgeeks.org/python-list/'
 
# parsing
url_link = requests.get(URL)
file = bs.BeautifulSoup(url_link.text, "lxml")
 
# find all tables
find_table = file.find('table', class_='numpy-table')
rows = find_table.find_all('tr')
 
# display tables
for i in rows:
    table_data = i.find_all('td')
    data = [j.text for j in table_data]
    print(data)

Output:

Step-by-step Approach to parse XML files:

Step 1: Before moving on, You can create your own ‘xml file’ or you can just copy and paste below code, and name it as test1.xml file on your system.

<?xml version="1.0" ?>
<books>
  <book>
    <title>Introduction of GeeksforLazyroar V1</title>
    <author>Gfg</author>
    <price>6.99</price>
  </book>
  <book>
    <title>Introduction of GeeksforLazyroar V2</title>
    <author>Gfg</author>
    <price>8.99</price>
  </book>
  <book>
    <title>Introduction of GeeksforLazyroar V2</title>
    <author>Gfg</author>
    <price>9.35</price>
  </book>
</books>

Step 2: Create a python file and import modules.

Python3

# import required modules
from bs4 import BeautifulSoup

Step 3: Read the content of the XML.

Python3

# reading content
file = open("test1.xml", "r")
contents = file.read()

Step 4: Parse the content of the XML.

Python3

# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')

Step 5: Display the content of the XML file.

Python3

# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')

Below is the complete program based on the above approach:

Python3

# import required modules
from bs4 import BeautifulSoup
 
# reading content
file = open("test1.xml", "r")
contents = file.read()
 
# parsing
soup = BeautifulSoup(contents, 'xml')
titles = soup.find_all('title')
 
# display content
for data in titles:
    print(data.get_text())

Output:

Parsing tables and XML with BeautifulSoup

Step-by-step Approach to parse Tables:

Python3

Python3

Python3

Python3

Python3

Step-by-step Approach to parse XML files:

Python3

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

How To Install PHP 8.2 on Ubuntu 22.04|20.04|18.04

Recent Comments

EDITOR PICKS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR POSTS

Vietnam’s Success in Software Outsourcing

Install Python 3 / Python 2.7 on Rocky Linux 8 |AlmaLinux 8

How To Manage Angular JS Projects using Angular CLI

POPULAR CATEGORY

ABOUT US

FOLLOW US