Extract JSON from HTML using BeautifulSoup in Python

27 July 2024

1

In this article, we are going to extract JSON from HTML using BeautifulSoup in Python.

Module needed

bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.

pip install bs4

requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.

pip install requests

Approach:

Import all the required modules.
Pass the URL in the get function(UDF) so that it will pass a GET request to a URL, and it will return a response.

Syntax: requests.get(url, args)

Now Parse the HTML content using bs4.

Syntax: BeautifulSoup(page.text, ‘html.parser’)

Parameters:

page.text : It is the raw HTML content.

html.parser : Specifying the HTML parser we want to use.

Now get all the required data with find() function.

Now find the customer list with li, a, p tag where some unique class or id. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.

Create a Json file and use json.dump() method to convert python objects into appropriate JSON objects.

Below is the full implementation:

Python3

# Import the required modules
import requests
from bs4 import BeautifulSoup
import json
 
 
# Function will return a list of dictionaries
# each containing information of books.
def json_from_html_using_bs4(base_url):
 
    # requests.get(url) returns a response that is saved
    # in a response object called page.
    page = requests.get(base_url)
 
    # page.text gives us access to the web data in text
    # format, we pass it as an argument to BeautifulSoup
    # along with the html.parser which will create a
    # parsed tree in soup.
    soup = BeautifulSoup(page.text, "html.parser")
 
    # soup.find_all finds the div's, all having the same
    # class "col-xs-6 col-sm-4 col-md-3 col-lg-3" that is
    # stored in books
    books = soup.find_all(
        'li', attrs={'class':
                'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
 
    # Initialise the required variables
    star = ['One', 'Two', 'Three', 'Four', 'Five']
    res, book_no = [], 1
     
    # Iterate books classand check for the given tags
    # to get the information of each books.
    for book in books:
 
        # Title of book in <img> tag with "alt" key.
        title = book.find('img')['alt']
 
        # Link of book in <a> tag with "href" key
        link = base_url[:37] + book.find('a')['href']
 
        # Rating of book from
 
<p> tag
        for index in range(5):
            find_stars = book.find(
            'p', attrs={'class': 'star-rating ' + star[index]})
             
            # Check which star-rating class is not
            # returning None and then break the loop
            if find_stars is not None:
                stars = star[index] + " out of 5"
                break
 
        # Price of book from
 
<p> tag in price_color class
        price = book.find('p', attrs={'class': 'price_color'
                                                    }).text
 
        # Stock Status of book from
 
<p> tag in
        # instock availability class.
        instock = book.find('p', attrs={'class':
                        'instock availability'}).text.strip()
         
        # Create a dictionary with the above book information
        data = {'book no': str(book_no), 'title': title,
            'rating': stars, 'price': price, 'link': link,
            'stock': instock}
 
        # Append the dictionary to the list
        res.append(data)
        book_no += 1
    return res
 
 
# Main Function
if __name__ == "__main__":
 
    # Enter the url of website
    base_url = "https://books.toscrape.com/catalogue/page-1.html"
 
    # Function will return a list of dictionaries
    res = json_from_html_using_bs4(base_url)
 
    # Convert the python objects into json object and export
    # it to books.json file.
    with open('books.json', 'w', encoding='latin-1') as f:
        json.dump(res, f, indent=8, ensure_ascii=False)
    print("Created Json File")

Output:

Created Json File

Our JSON file output:

Extract JSON from HTML using BeautifulSoup in Python

Module needed

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

PureVPN vs. Private Internet Access 2025: Which Is Better? by Gjurgjica Panova

Recent Comments

EDITOR PICKS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US