In this article, we are going to extract JSON from HTML using BeautifulSoup in Python.
Module needed
- bs4: Beautiful Soup(bs4) is a Python library for pulling data out of HTML and XML files. This module does not come built-in with Python. To install this type the below command in the terminal.
pip install bs4
- requests: Request allows you to send HTTP/1.1 requests extremely easily. This module also does not come built-in with Python. To install this type the below command in the terminal.
pip install requests
Approach:
- Import all the required modules.
- Pass the URL in the get function(UDF) so that it will pass a GET request to a URL, and it will return a response.
Syntax: requests.get(url, args)
- Now Parse the HTML content using bs4.
Syntax: BeautifulSoup(page.text, ‘html.parser’)
Parameters:
- page.text : It is the raw HTML content.
- html.parser : Specifying the HTML parser we want to use.
- Now get all the required data with find() function.
Now find the customer list with li, a, p tag where some unique class or id. You can open the webpage in the browser and inspect the relevant element by pressing right-click as shown in the figure.
- Create a Json file and use json.dump() method to convert python objects into appropriate JSON objects.
Below is the full implementation:
Python3
# Import the required modules import requests from bs4 import BeautifulSoup import json # Function will return a list of dictionaries # each containing information of books. def json_from_html_using_bs4(base_url): # requests.get(url) returns a response that is saved # in a response object called page. page = requests.get(base_url) # page.text gives us access to the web data in text # format, we pass it as an argument to BeautifulSoup # along with the html.parser which will create a # parsed tree in soup. soup = BeautifulSoup(page.text, "html.parser" ) # soup.find_all finds the div's, all having the same # class "col-xs-6 col-sm-4 col-md-3 col-lg-3" that is # stored in books books = soup.find_all( 'li' , attrs = { 'class' : 'col-xs-6 col-sm-4 col-md-3 col-lg-3' }) # Initialise the required variables star = [ 'One' , 'Two' , 'Three' , 'Four' , 'Five' ] res, book_no = [], 1 # Iterate books classand check for the given tags # to get the information of each books. for book in books: # Title of book in <img> tag with "alt" key. title = book.find( 'img' )[ 'alt' ] # Link of book in <a> tag with "href" key link = base_url[: 37 ] + book.find( 'a' )[ 'href' ] # Rating of book from <p> tag for index in range ( 5 ): find_stars = book.find( 'p' , attrs = { 'class' : 'star-rating ' + star[index]}) # Check which star-rating class is not # returning None and then break the loop if find_stars is not None : stars = star[index] + " out of 5" break # Price of book from <p> tag in price_color class price = book.find( 'p' , attrs = { 'class' : 'price_color' }).text # Stock Status of book from <p> tag in # instock availability class. instock = book.find( 'p' , attrs = { 'class' : 'instock availability' }).text.strip() # Create a dictionary with the above book information data = { 'book no' : str (book_no), 'title' : title, 'rating' : stars, 'price' : price, 'link' : link, 'stock' : instock} # Append the dictionary to the list res.append(data) book_no + = 1 return res # Main Function if __name__ = = "__main__" : # Enter the url of website # Function will return a list of dictionaries res = json_from_html_using_bs4(base_url) # Convert the python objects into json object and export # it to books.json file. with open ( 'books.json' , 'w' , encoding = 'latin-1' ) as f: json.dump(res, f, indent = 8 , ensure_ascii = False ) print ( "Created Json File" ) |
Output:
Created Json File
Our JSON file output: