Extract all the URLs that are nested within tags using BeautifulSoup

27 July 2024

0

Beautiful Soup is a python library used for extracting html and xml files. In this article we will understand how we can extract all the URLSs from a web page that are nested within <li> tags.

Module needed and installation:

BeautifulSoup: Our primary module contains a method to access a webpage over HTTP.

pip install bs4

Requests: used to perform GET request to the webpage and get their content.

Note: You do not need to install it separately as it downloads automatically with bs4, but in case of any problem you can download it manually.

pip install requests

Approach

We will first import our required libraries.
We will perform a get request to the desired web page from which we want all the URLs from.
We will pass the text to the BeautifulSoup function and convert it to a soup object.
Using a for loop we will look for all the <li> tags in the webpage.
If a <li> tag has an anchor tag in it we will look for the href attribute and store its parameter in a list. It is the url we were looking for.
The print the list that contains all the urls.

Let’s have a look at the code, We will see what’s happening at each significant step.

Step 1: Initialize the Python program by importing all the required libraries and setting up the URL of the web page from which you want all the URLs contained in an anchor tag.

In the following example, we will take another geek for Lazyroar article on implementing web scraping using BeautifulSoup and extract all the URLs stored in anchor tags nested within <li> tag.

LInk of the article is : https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

Python3

# Importing libraries 
import requests 
from bs4 import BeautifulSoup 
  
# setting up the URL 
URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'

Step 2: We will perform a get request to the desired URL and pass all the text from it into BeautifuLSoup and convert it into a soup object. We will set the parser as html.parser. You can set it different depending on the webpage you are scraping.

Python3

# perform get request to the url 
reqs = requests.get(URL) 
  
# extract all the text that you received  
# from the GET request   
content = reqs.text 
  
# convert the text to a beautiful soup object 
soup = BeautifulSoup(content, 'html.parser')

Step 3: Create an empty list to store all the URLs that you will receive as your desired output. Run a for loop that iterates over all the <li> tags in the web page. Then for each <li> tag check if it has an anchor tag in it. If that anchor tag has an href attribute then store the parameter of that href in the list that you created.

Python3

# Empty list to store the output 
urls = [] 
  
# For loop that iterates over all the <li> tags 
for h in soup.findAll('li'): 
    
    # looking for anchor tag inside the <li>tag 
    a = h.find('a') 
    try: 
          
        # looking for href inside anchor tag 
        if 'href' in a.attrs: 
              
            # storing the value of href in a separate  
            # variable 
            url = a.get('href') 
              
            # appending the url to the output list 
            urls.append(url) 
      
    # if the list does not has a anchor tag or an anchor  
    # tag does not has a href params we pass 
    except: 
        pass

Step 4: We print the output by iterating over the list of the url.

Python3

# print all the urls stored in the urls list 
for url in urls: 
    print(url) 

Complete code:

Python3

# Importing libraries 
import requests 
from bs4 import BeautifulSoup 
  
# setting up the URL 
URL = 'https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/'
  
# perform get request to the url 
reqs = requests.get(URL) 
  
# extract all the text that you received from 
# the GET request 
content = reqs.text 
  
# convert the text to a beautiful soup object 
soup = BeautifulSoup(content, 'html.parser') 
  
# Empty list to store the output 
urls = [] 
  
# For loop that iterates over all the <li> tags 
for h in soup.findAll('li'): 
    
    # looking for anchor tag inside the <li>tag 
    a = h.find('a') 
    try: 
        
        # looking for href inside anchor tag 
        if 'href' in a.attrs: 
            
            # storing the value of href in a separate variable 
            url = a.get('href') 
              
            # appending the url to the output list 
            urls.append(url) 
              
    # if the list does not has a anchor tag or an anchor tag 
    # does not has a href params we pass 
    except: 
        pass
  
# print all the urls stored in the urls list 
for url in urls: 
    print(url) 

Output:

Extract all the URLs that are nested within tags using BeautifulSoup

Module needed and installation:

Approach

Let’s have a look at the code, We will see what’s happening at each significant step.

Python3

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Interview With Willem Dewulf – CEO of ProBackup by Shauli Zacks

Recent Comments

EDITOR PICKS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR POSTS

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

POPULAR CATEGORY

ABOUT US

FOLLOW US