How to parse local HTML file in Python?

28 July 2024

3

Parsing means dividing a file or input into pieces of information/data that can be stored for our personal use in the future. Sometimes, we need data from an existing file stored on our computers, parsing technique can be used in such cases. The parsing includes multiple techniques used to extract data from a file. The following includes Modifying the file, Removing something from the file, Printing data, using the recursive child generator method to traverse data from the file, finding the children of tags, web scraping from a link to extract useful information, etc.

Modifying the file

Using the prettify method to modify the HTML code from- https://festive-knuth-1279a2.netlify.app/, look better. Prettify makes the code look in the standard form like the one used in VS Code.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Importing the HTTP library
import requests as req
  
# Requesting for the website
Web = req.get('https://festive-knuth-1279a2.netlify.app/')
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(Web.text, 'lxml')
  
# Using the prettify method
print(S.prettify())

Output:

Removing a tag

A tag can be removed by using the decompose method and the select_one method with the CSS selectors to select and then remove the second element from the li tag and then using the prettify method to modify the HTML code from the index.html file.

Example:

File Used:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the select-one method to find the second element from the li tag
Tag = S.select_one('li:nth-of-type(2)')
  
# Using the decompose method
Tag.decompose()
  
# Using the prettify method to modify the code
print(S.body.prettify())

Output:

Finding tags

Tags can be found normally and printed normally using print().

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
Parse = BeautifulSoup(index, 'lxml')
  
# Printing html code of some tags
print(Parse.head)
print(Parse.h1)
print(Parse.h2)
print(Parse.h3)
print(Parse.li)

Output:

Traversing tags

The recursiveChildGenerator method is used to traverse tags, which recursively finds all the tags within tags from the file.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the recursiveChildGenerator method to traverse the html file
for TraverseTags in S.recursiveChildGenerator():
  # Traversing the names of the tags
    if TraverseTags.name:
      # Printing the names of the tags
        print(TraverseTags.name)

Output:

Parsing name and text attributes of tags

Using the name attribute of the tag to print its name and the text attribute to print its text along with the code of the tag- ul from the file.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Printing the Code, name, and text of a tag
print(f'HTML: {S.ul}, name: {S.ul.name}, text: {S.ul.text}')

Output:

Finding Children of a tag

The Children attribute is used to get the children of a tag. The Children attribute returns ‘tags with spaces’ between them, we’re adding a condition- e. name is not None to print only names of the tags from the file.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Providing the source
Attr = S.html
  
# Using the Children attribute to get the children of a tag
# Only contain tag names and not the spaces
Attr_Tag = [e.name for e in Attr.children if e.name is not None]
  
# Printing the children
print(Attr_Tag)

Output:

Finding Children at all levels of a tag:

The Descendants attribute is used to get all the descendants (Children at all levels) of a tag from the file.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Providing the source
Des = S.body
  
# Using the descendants attribute
Attr_Tag = [e.name for e in Des.descendants if e.name is not None]
  
# Printing the children
print(Attr_Tag)

Output:

Finding all elements of tags

Using find_all():

The find_all method is used to find all the elements (name and text) inside the p tag from the file.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the find_all method to find all elements of a tag
for tag in S.find_all('p'):
  
  # Printing the name, and text of p tag
    print(f'{tag.name}: {tag.text}')

Output:

CSS selectors to find elements:

Using the select method to use the CSS selectors to find the second element from the li tag from the file.

Example:

Python3

# Importing BeautifulSoup class from the bs4 module
from bs4 import BeautifulSoup
  
# Opening the html file
HTMLFile = open("index.html", "r")
  
# Reading the file
index = HTMLFile.read()
  
# Creating a BeautifulSoup object and specifying the parser
S = BeautifulSoup(index, 'lxml')
  
# Using the select method
# Prints the second element from the li tag
print(S.select('li:nth-of-type(2)'))

Output:

How to parse local HTML file in Python?

Modifying the file

Python3

Removing a tag

Python3

Finding tags

Python3

Traversing tags

Python3

Parsing name and text attributes of tags

Python3

Finding Children of a tag

Python3

Python3

Finding all elements of tags

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US