BeautifulSoup – Scraping Link from HTML

By Calisto Chipfumbu

28 July 2024

0

Prerequisite: Implementing Web Scraping in Python with BeautifulSoup

In this article, we will understand how we can extract all the links from a URL or an HTML document using Python.

Libraries Required:

bs4 (BeautifulSoup): It is a library in python which makes it easy to scrape information from web pages, and helps in extracting the data from HTML and XML files. This library needs to be downloaded externally as it does not come readily with Python package. To install this library, type the following command in your terminal.

pip install bs4

requests: This library enables to send the HTTP requests and fetch the web page content very easily. This library also needs to be downloaded externally as it does not come readily with Python package. To install this library, type the following command in your terminal.

pip install requests

Steps to be followed:

Import the required libraries (bs4 and requests)
Create a function to get the HTML document from the URL using requests.get() method by passing URL to it.
Create a Parse Tree object i.e. soup object using of BeautifulSoup() method, passing it HTML document extracted above and Python built-in HTML parser.
Use the a tag to extract the links from the BeautifulSoup object.
Get the actual URLs from the form all anchor tag objects with get() method and passing href argument to it.
Moreover, you can get the title of the URLs with get() method and passing title argument to it.

Implementation:

Python3

# import necessary libraries
from bs4 import BeautifulSoup
import requests
import re
  
  
# function to extract html document from given url
def getHTMLdocument(url):
      
    # request for HTML document of given url
    response = requests.get(url)
      
    # response will be provided in JSON format
    return response.text
  
    
# assign required credentials
# assign URL
url_to_scrape = "https://practice.geeksforgeeks.org/courses/"
  
# create document
html_document = getHTMLdocument(url_to_scrape)
  
# create soap object
soup = BeautifulSoup(html_document, 'html.parser')
  
  
# find all the anchor tags with "href" 
# attribute starting with "https://"
for link in soup.find_all('a', 
                          attrs={'href': re.compile("^https://")}):
    # display the actual urls
    print(link.get('href'))  

Output:

BeautifulSoup – Scraping Link from HTML

Libraries Required:

Steps to be followed:

Implementation:

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

Google Messages can now show your profile exactly how it’s supposed to be

Recent Comments

EDITOR PICKS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR POSTS

How to factory reset the Google Pixel 8a

The 2024 YouTube Music Recap could be here any day now

How to install Proton VPN on a Fire TV Stick

POPULAR CATEGORY

ABOUT US

FOLLOW US