Sunday, November 17, 2024
Google search engine
HomeLanguagesScrapy – Link Extractors

Scrapy – Link Extractors

In this article, we are going to learn about Link Extractors in scrapy. “LinkExtractor” is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we’ll see in the below post. 

Scrapy – Link Extractors

Basically using the “LinkExtractor” class of scrapy we can find out all the links which are present on a webpage and fetch them in a very easy way. We need to install the scrapy module (if not installed yet) by running the following command in the terminal:

pip install scrapy

Link Extractor class of Scrapy 

So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a response object. For convenience scrapy also provides us with “scrapy.linkextractors.LinkExtractor“.

Firstly we need to import the LinkExtractor. There are quite a few ways to import and use the LinkExtractor class, but one of them is to import it in the following way:

from scrapy.linkextractors import LinkExtractor

Also, Using the LinkExtractor class. To use the “LinkExtractor” class you need to create the object as given below :

link_ext = LinkExtractor(arguments) 

We can also fetch the links. Now that we have created an object, to fetch links we will use the “extract_links” method of the LinkExtractor class. For that run below code : 

links = link_ext.extract_links(response)

The links fetched are in list format and of the type “scrapy.link.Link” . The parameters of the link object are:

  1. url : url of the fetched link.
  2. text : the text used in the anchor tag of the link.
  3. fragment : the part of the url after the hash (#) symbol.
  4. no-follow : tells whether the value of “rel” attribute of the anchor tag is “nofollow” or not.

Stepwise Implementation

Step 1: Creating a spider

A spider is basically a class in scrapy which is used to fetch requests and get a response from a particular website. The code for creating a spider is as follows:

Python3




# importing the LinkExtractor
import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class MySpider(scrapy.Spider):
      
    # you can name anything you want
    name = "MySpider"
      
    # urls to be fetched
    start_urls = []


So, here we first imported the scrapy module along with specifically importing the “LinkExtractor” class. Then we created a class named “MySpider” and inherited it from “scrapy.Spider” class.

Then we created a few class variables namely “name” and “start_urls“.

  • name : whatever name you want to give to the class
  • start_urls: all the URLs which need to be fetched are given here. 

Then those “start_urls” are fetched and the “parse function is run on the response obtained from each of them one by one. This is done automatically by scrapy.

Step 2: Creating the LinkExtractor object and Yielding  results

You can create an instance of the “LinkExtractor” class anywhere you want.

In this let us create an instance of the class in the “parse” method itself.

Python3




# Parse Method
def parse(self, response):
    
    # creating the instance of LinkExtractor
    # class
    link_extractor = LinkExtractor()
      
    # extracting links (returns List of links)
    links = link_extractor.extract_links(response)
  
    # Yielding results
    for link in links:
        
        # parameters of link : url, text,
        # fragment, nofollow
        # example yield output
        yield {"url": link.url, "text": link.text}


Finally, the full code is :

Python3




import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class MySpider(scrapy.Spider):
    name = "MySpider"
      
    # urls to be fetched
    start_urls = []
  
    def parse(self, response):
        link_extractor = LinkExtractor()
        links = link_extractor.extract_links(response)
          
        for link in links:
              
            # parameters of link : url, text, 
            # fragment, nofollow
            yield {"url": link.url, "text": link.text}


Step 3 : Running the code

Now we can run the “spider” and fetch the desired result in “json” file (or any other formats supported by scrapy) .

scrapy runspider <python-file> -o <output-file-name> 

Link Extractors using Scrapy

Example 1 :

Let us fetch all the links from the webpage https://quotes.toscrape.com/ and store the output in a JSON file named “quotes.json” :

Python3




# scrapy_link_extractor.py
import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class QuoteSpider(scrapy.Spider):
    name = "OuoteSpider"
    start_urls = ["https://quotes.toscrape.com/"]
  
    def parse(self, response):
        link_extractor = LinkExtractor()
        links = link_extractor.extract_links(response)
  
        for link in links:
            yield {"url": link.url, "text": link.text}


To run the above code we run the following command :

scrapy runspider scrapy_link_extractor.py -o quotes.json

Output:

scrapy link extractor example 1

Example 2 :

Let us this time fetch all the links from the website https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/  .

This time let us create the instance of the “LinkExtractor” class in the constructor of our Spider and also yield the “nofollow” parameter of the link object. Also let us set the “unique” parameter of “LinkExtractor” to “True” so that we fetch unique results only.

Python3




import scrapy
from scrapy.linkextractors import LinkExtractor
  
  
class GeeksForGeeksSpider(scrapy.Spider):
    name = "GeeksForGeeksSpider"
    start_urls = [
        "https://www.geeksforgeeks.org/email-id-extractor-\
        project-from-sites-in-scrapy-python/"]
  
    def __init__(self, name=None, **kwargs):
        super().__init__(name, **kwargs)
  
        self.link_extractor = LinkExtractor(unique=True)
  
    def parse(self, response):
        links = self.link_extractor.extract_links(response)
  
        for link in links:
            yield {"nofollow": link.nofollow, "url": link.url, "text": link.text}


To run the above code, we run the following command in the terminal :

scrapy runspider scrapy_link_extractor.py -o neveropen.json

Output:

output of scrapy link extractor example 2

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments