Python | Extract URL from HTML using lxml

27 July 2024

1

Link extraction is a very common task when dealing with the HTML parsing. For every general web crawler that’s the most important function to perform. Out of all the Python libraries present out there, lxml is one of the best to work with. As explained in this article, lxml provides a number of helper function in order to extract the links.

lxml installation –

It is a Python binding for C libraries – libxslt and libxml2. So, maintaining a Python base, it is very fast HTML parsing and XML library. To let it work – C libraries also need to be installed. For installation instruction, follow this link.

Command to install –

sudo apt-get install python-lxml or
pip install lxml

What is lxml?
It is designed specifically for parsing HTML and therefore comes with an html module. HTML string can be easily parsed with the help of fromstring() function. This will return the list of all the links.

The iterlinks() method has four parameters of tuple form –

element : Link is extracted from this parsed node of the anchor tag. If interested in the link only, this can be ignored.
attr : attribute of the link from where it has come from, that is simply ‘href’
link : The actual URL extracted from the anchor tag.
pos : The anchor tag numeric index of the anchor tag in the document.

Code #1 :

# importing library 
from lxml import html 
string_document = html.fromstring('hi <a href ="/world">Lazyroar</a>') 
  
# actual url 
link = list(string_document.iterlinks()) 
  
# Link length 
print ("Length of the link : ", len(link) 

Output :

Length of the link : 1

Code #2 : Retrieving the iterlinks() tuple

(element, attribute, link, pos) = link[0] 
      
print ("attribute : ", attribute) 
print ("\nlink : ", link) 
print ("\nposition : ", position) 

Output :

attribute : 'href'

link : '/world'

position : 0

Working –

ElementTree is built up when lxml parses the HTML. ElementTree is a tree structure having parent and child nodes. Each node in the tree is representing an HTML tag and it contains all the relative attributes of the tag. A tree after its creation can be iterated on to find elements. These elements can be an anchor or link tag. While the lxml.html module contains only HTML-specific functions for creating and iterating a tree, lxml.etree module contains the core tree handling code.

HTML parsing from files –

Instead of using fromstring() function to parse an HTML, parse() function can be called with the filename or the URL – like html.parse('http://the/url') or html.parse('/path/to/filename'). Same result will be generated as loaded in the URL or file as in the string and then call fromstring().

Code #3 : ElementTree working

import requests 
import lxml.html 
  
# requesting url 
web_response = requests.get('https://www.geeksforgeeks.org/') 
  
# building 
element_tree = lxml.html.fromstring(web_response.text) 
  
tree_title_element = element_tree.xpath('//title')[0] 
  
print("Tag title : ", tree_title_element.tag) 
print("\nText title :", tree_title_element.text_content()) 
print("\nhtml title :", lxml.html.tostring(tree_title_element)) 
print("\ntitle tag:", tree_title_element.tag) 
print("\nParent's tag title:", tree_title_element.getparent().tag) 

Output :

Tag title :  title

Text title : Lazyroar | A computer science portal for Lazyroar

html title : b'Lazyroar | A computer science portal for Lazyroar\r\n'

title tag: title

Parent's tag title: head

Using request to scrap –

request is a Python library, used to scrap the website. It requests the URL of the webserver using get() method with URL as a parameter and in return, it gives the Response object. This object will include details about the request and the response. To read the web content, response.text() method is used. This content is sent back by the webserver under the request.

Code #4 : Requesting web server

import requests 
  
web_response = requests.get('https://www.geeksforgeeks.org/') 
print("Response from web server : \n", web_response.text) 

Output :
It will generate a huge script, of which only a sample is added here.

Response from web server : 

<!DOCTYPE html>
<!--[if IE 7]>
<html class="ie ie7" lang="en-US" prefix="og: http://ogp.me/ns#">
<![endif]-->
<<!-->
<html lang="en-US" prefix="og: http://ogp.me/ns#" >
...
...
...

Python | Extract URL from HTML using lxml

Working –

HTML parsing from files –

Using request to scrap –

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

OnePlus’ decision to ditch Samsung’s OLED screens could backfire in the US

Recent Comments

EDITOR PICKS

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

POPULAR POSTS

Interview With Bill Reed – CEO at RemotelyMe by Shauli Zacks

Samsung’s Galaxy S24 FE plummets to the price it should have been at launch

Samsung’s new periscope camera fits telephoto lenses into an even slimmer design

POPULAR CATEGORY

ABOUT US

FOLLOW US