Parsing and converting HTML documents to XML format using Python

By Nicole Veronica

27 July 2024

0

In this article, we are going to see how to parse and convert HTML documents to XML format using Python.

It can be done in these ways:

Using Ixml module.
Using Beautifulsoup module.

Method 1: Using the Python lxml library

In this approach, we will use Python’s lxml library to parse the HTML document and write it to an encoded string representation of the XML tree.The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It is unique in that as it combines the speed and XML feature completeness of these libraries with the simplicity of a native Python API, mostly compatible but superior to the well-known ElementTree API.

Installation:

pip install lxml

We need to provide the path to open the HTML document to read and parse it using the html.fromstring(str) function, returning a single element/document tree. This function parses a document from the given string. This always creates a correct HTML document, which means the parent node is <html>, and there is a body and possibly a head.

htmldoc = html.fromstring(inp.read())

And write the parsed HTML element/document tree to an encoded string representation of its XML tree using the etree.tostring() function.

out.write(etree.tostring(htmldoc))

Html file used: input.

Code:

Python3

# Import the required library 
from lxml import html, etree 
  
# Main Function 
if __name__ == '__main__': 
  
    # Provide the path of the html file 
    file = "input.html"
  
    # Open the html file and Parse it,  
    # returning a single element/document. 
    with open(file, 'r', encoding='utf-8') as inp: 
        htmldoc = html.fromstring(inp.read()) 
  
    # Open a output.xml file and write the  
    # element/document to an encoded string  
    # representation of its XML tree. 
    with open("output.xml", 'wb') as out: 
        out.write(etree.tostring(htmldoc))

Output:

Method 2: Using the BeautifulSoup

In this approach, we will use the BeautifulSoup module to parse the raw HTML document using html.parser and modify the parsed document and write it to an XML file. Provide the path to open the HTML file and read the HTML file and Parse it using BeautifulSoup’s html.parser, returning an object of the parsed document.

BeautifulSoup(inp, ‘html.parser’)

To remove the DocType HTML, we need to first get the string representation of the document using soup.prettify() and then split the document by lines using splitlines(), returning a list of lines.

soup.prettify().splitlines()

Code:

Python3

# Import the required library 
from bs4 import BeautifulSoup 
  
# Main Function 
if __name__ == '__main__': 
  
    # Provide the path of the html file 
    file = "input.html"
  
    # Open the html file and Parse it  
    # using Beautiful soup's html.parser. 
    with open(file, 'r', encoding='utf-8') as inp: 
        soup = BeautifulSoup(inp, 'html.parser') 
      
    # Split the document by lines and join the lines 
    # from index 1 to remove the doctype Html as it is  
    # present in index 0 from the parsed document. 
    lines = soup.prettify().splitlines() 
    content = "\n".join(lines[1:]) 
  
    # Open a output.xml file and write the modified content. 
    with open("output.xml", 'w', encoding='utf-8') as out: 
        out.write(content)

Output:

Parsing and converting HTML documents to XML format using Python

Method 1: Using the Python lxml library

Python3

Method 2: Using the BeautifulSoup

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Recent Comments

EDITOR PICKS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR POSTS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US