In this article, we are going to learn about Link Extractors in scrapy. “LinkExtractor” is a class provided by scrapy to extract links from the response we get while fetching a website. They are very easy to use which we’ll see in the below post.
Scrapy – Link Extractors
Basically using the “LinkExtractor” class of scrapy we can find out all the links which are present on a webpage and fetch them in a very easy way. We need to install the scrapy module (if not installed yet) by running the following command in the terminal:
pip install scrapy
Link Extractor class of Scrapy
So, scrapy have the class “scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor” for extracting the links from a response object. For convenience scrapy also provides us with “scrapy.linkextractors.LinkExtractor“.
Firstly we need to import the LinkExtractor. There are quite a few ways to import and use the LinkExtractor class, but one of them is to import it in the following way:
from scrapy.linkextractors import LinkExtractor
Also, Using the LinkExtractor class. To use the “LinkExtractor” class you need to create the object as given below :
link_ext = LinkExtractor(arguments)
We can also fetch the links. Now that we have created an object, to fetch links we will use the “extract_links” method of the LinkExtractor class. For that run below code :
links = link_ext.extract_links(response)
The links fetched are in list format and of the type “scrapy.link.Link” . The parameters of the link object are:
- url : url of the fetched link.
- text : the text used in the anchor tag of the link.
- fragment : the part of the url after the hash (#) symbol.
- no-follow : tells whether the value of “rel” attribute of the anchor tag is “nofollow” or not.
Stepwise Implementation
Step 1: Creating a spider
A spider is basically a class in scrapy which is used to fetch requests and get a response from a particular website. The code for creating a spider is as follows:
Python3
# importing the LinkExtractor import scrapy from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.Spider): # you can name anything you want name = "MySpider" # urls to be fetched start_urls = [] |
So, here we first imported the scrapy module along with specifically importing the “LinkExtractor” class. Then we created a class named “MySpider” and inherited it from “scrapy.Spider” class.
Then we created a few class variables namely “name” and “start_urls“.
- name : whatever name you want to give to the class
- start_urls: all the URLs which need to be fetched are given here.
Then those “start_urls” are fetched and the “parse“ function is run on the response obtained from each of them one by one. This is done automatically by scrapy.
Step 2: Creating the LinkExtractor object and Yielding results
You can create an instance of the “LinkExtractor” class anywhere you want.
In this let us create an instance of the class in the “parse” method itself.
Python3
# Parse Method def parse( self , response): # creating the instance of LinkExtractor # class link_extractor = LinkExtractor() # extracting links (returns List of links) links = link_extractor.extract_links(response) # Yielding results for link in links: # parameters of link : url, text, # fragment, nofollow # example yield output yield { "url" : link.url, "text" : link.text} |
Finally, the full code is :
Python3
import scrapy from scrapy.linkextractors import LinkExtractor class MySpider(scrapy.Spider): name = "MySpider" # urls to be fetched start_urls = [] def parse( self , response): link_extractor = LinkExtractor() links = link_extractor.extract_links(response) for link in links: # parameters of link : url, text, # fragment, nofollow yield { "url" : link.url, "text" : link.text} |
Step 3 : Running the code
Now we can run the “spider” and fetch the desired result in “json” file (or any other formats supported by scrapy) .
scrapy runspider <python-file> -o <output-file-name>
Link Extractors using Scrapy
Example 1 :
Let us fetch all the links from the webpage https://quotes.toscrape.com/ and store the output in a JSON file named “quotes.json” :
Python3
# scrapy_link_extractor.py import scrapy from scrapy.linkextractors import LinkExtractor class QuoteSpider(scrapy.Spider): name = "OuoteSpider" def parse( self , response): link_extractor = LinkExtractor() links = link_extractor.extract_links(response) for link in links: yield { "url" : link.url, "text" : link.text} |
To run the above code we run the following command :
scrapy runspider scrapy_link_extractor.py -o quotes.json
Output:
Example 2 :
Let us this time fetch all the links from the website https://www.geeksforgeeks.org/email-id-extractor-project-from-sites-in-scrapy-python/ .
This time let us create the instance of the “LinkExtractor” class in the constructor of our Spider and also yield the “nofollow” parameter of the link object. Also let us set the “unique” parameter of “LinkExtractor” to “True” so that we fetch unique results only.
Python3
import scrapy from scrapy.linkextractors import LinkExtractor class GeeksForGeeksSpider(scrapy.Spider): name = "GeeksForGeeksSpider" start_urls = [ "https: / / www.geeksforgeeks.org / email - id - extractor - \ project - from - sites - in - scrapy - python / "] def __init__( self , name = None , * * kwargs): super ().__init__(name, * * kwargs) self .link_extractor = LinkExtractor(unique = True ) def parse( self , response): links = self .link_extractor.extract_links(response) for link in links: yield { "nofollow" : link.nofollow, "url" : link.url, "text" : link.text} |
To run the above code, we run the following command in the terminal :
scrapy runspider scrapy_link_extractor.py -o neveropen.json
Output: