Scrapy-selenium is a middleware that is used in web scraping. scrapy do not support scraping modern sites that uses javascript frameworks and this is the reason that this middleware is used with scrapy to scrape those modern sites.Scrapy-selenium provide the functionalities of selenium that help in working with javascript websites. Other advantages provided by this is driver by which we can also see what is happening behind the scenes. As selenium is automated tool it also provides us to how to deal with input tags and scrape according to what you pass in input field. Passing inputs in input fields became easier by using selenium.First time scrapy-selenium was introduced in 2018 and its an opensource. The alternative to this can be scrapy-splash
Install and Setup Scrapy –
Install scrapy
Run
scrapy startproject projectname (projectname is name of project)
Now, let’s Run,
scrapy genspider spidername example.com
(replace spidername with your preferred spider name and example.com with website that you want to scrape). Note: Later also url can be changed, inside your scrapy spider. scrapy spider:
Integrating scrapy-selenium in scrapy project:
Install scrapy-selenium and add this in your settings.py file
In this project chrome driver is used.Chrome driver is to be downloaded according to version of chrome browser. Go to help section in your chrome browser then click about Google chrome and check your version.Download chrome driver from website as referred here To download chrome driver
Where to add chromedriver:
Addition in settings.py file:
Change to be made in spider file:
To run project:
command- scrapy crawl spidername (scrapy crawl integratedspider in this project)
name- name is a variable where name of spider is written and each spider is recognized by this name. The command to run spider is, scrapy crawl spidername (Here spidername is referred to that name which is defined in the spider).
function start_requests- The first requests to perform are obtained by calling the start_requests() method which generates Request for the URL specified in the url field in yield SeleniumRequest and the parse method as callback function for the Requests
url- Here url of the site is provided.
screenshot- You can take a screenshot of a web page with the method get_screenshot_as_file() with as parameter the filename and screenshot will save in project.
callback- The function that will be called with the response of this request as its first parameter.
dont_filter- indicates that this request should not be filtered by the scheduler. if same url is send to parse it will not give exception of same url already accessed. What it means is same url can be accessed more than once.default value is false.
wait_time- Scrapy doesn’t wait a fixed amount of time between requests. But by this field we can assign it during callback.
General structure of scrapy-selenium spider:
Python3
importscrapy
fromscrapy_selenium importSeleniumRequest
classIntegratedspiderSpider(scrapy.Spider):
name ='integratedspider'
defstart_requests(self):
yieldSeleniumRequest(
url ="https://www.geeksforgeeks.org/",
wait_time =3,
screenshot =True,
callback =self.parse,
dont_filter =True
)
defparse(self, response):
pass
Project of Scraping with scrapy-selenium: scraping online courses names from neveropen site using scrapy-selenium Getting X-path of element we need to scrap – Code to scrap Courses Data from GeeksforLazyroar –