Scrapy is an open-source tool built with Python Framework. It presents us with a strong and robust web crawling framework that can easily extract the info from the online page with the assistance of selectors supported by XPath.
We can define the behavior of Scrapy components with the help of Scrapy settings. Pipelines and setting files are very important for scrapy. It is the core of automating the task. These rules help with inserting data into the database. These files are includes when we start with the base template. The Scrapy settings allow you to customize the behavior of all Scrapy components, including the core, extensions, pipelines, and spiders themselves.
We are often presented with the situation where we need to define multiple crapper projects in that case we can define which individual project with the help of scrapy settings. For this, the environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax. Hence, with the help of the Scrapy settings, the mechanism for choosing the currently active Scrapy project could be specified.
The infrastructure of the settings provides a worldwide namespace of key-value mappings that the code can use to tug configuration values from. The settings are often populated through different mechanisms, which are described below.
Use these commands to start the scrapy template folder.
scrapy startproject <project_name>
This is the base outline of the scrapy project.
With this article, we would be focusing on the settings.py file.
The settings.py file looks something like this. We are provided with this as our default settings.
Most commonly used settings and their description is given below:
Important Scrapy Settings
- BOT_NAME
It is the name of the project. The bot symbolizes the automation that we are doing with the help of the scraper. It defaults to ‘scrapybot’. Also as seen in the screenshot it is automatically available with your project name when you start the project.
- USER_AGENT
User-Agent helps us with the identification. It basically tells “who you are” to the servers and network peers. It helps with the identification of the application, OS, vendor, and/or version of the requesting user agent. It defaults to “Scrapy/VERSION (+https://scrapy.org)” while crawling unless explicitly specified.
The common format for browsers:
User-Agent: <browser>/<version> (<system-info>) <platform> (<platform-details>) <extensions>
For Example:
# Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
- ROBOTSTXT_OBEY
A robots.txt file basically tells the crawlers from search engines which pages it could request from the site. ROBOTSTXT_OBEY defaults to “False”. It is mostly kept enabled, so our scrapy will respect the robots.txt policies by the website.
The image shows the content of the file robots.txt, the policies are written here are managed by the ROBOTSTXT_OBEY setting.
- CONCURRENT_REQUESTS
It is basically asking the website to open up. It defaults to 16. So basically it is the maximum number of the request that the crawler will perform.
More request increases a load to the server so keeping it as low as 16 or 32 is a good value.
- CONCURRENT_ITEMS
It means while scraping the data what a maximum number of concurrent items the scrapy will process in parallel per response. It defaults to 100, which is again a good value.
custom_settings = { 'CONCURRENT_REQUESTS' = 30, 'CONCURRENT_ITEMS' = 80, }
- CONCURRENT_REQUESTS_PER_DOMAIN
It means while scraping the data what is the maximum number of existing requests that can be performed concurrently for any single domain value. It defaults to value ‘8’.
- CONCURRENT_REQUESTS_PER_IP
It means while scraping the data what is the maximum number of existing requests that can be performed concurrently for any single IP address. It defaults to the value ‘0’.
custom_settings = { 'CONCURRENT_REQUESTS_PER_DOMAIN' = 8, 'CONCURRENT_REQUESTS_PER_IP' = 2 }
- DOWNLOAD_DELAY
It is the delay in the amount of time that the downloader would before again downloading the pages from the website. This again is used to limit the load on the server where the website is hosted. It defaults to 0.
For Example:
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
- DOWNLOAD_TIMEOUT
It is the time-out time. Tells scrapy to wait for the given amount of time to wait before the downloader times out. It defaults to 180.
- LOG_ENABLED
It is used to enable or disable the logging for the scrapper. It defaults to “True”.
- FTP_PASSWORD
Used to set a password for the FTP connections. The value is used only when there is no “ftp_password” in Request meta. It defaults to “guest”.
- FTP_USER
Used to set a username for the FTP connections. The value is used only when there is no “ftp_user” in Request meta. It defaults to “anonymous”.
- DEFAULT_ITEM_CLASS
This setting is used to represent items within a scrapy, the values are stored in this class format specified by DEFAULT_ITEM_CLASS. The default format is given by ‘scrapy.item.Item’.
- DEFAULT_REQUEST_HEADERS
The given setting lists the default header used for HTTP requests made by Scrapy. It is populated within the DefaultHeadersMiddleware.
The default header value is given by:
{ 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', }
- REACTOR_THREADPOOL_MAXSIZE
The reactor thread pool could also be set within the scrapy. It binds the max size for the reactor thread pool of the spider. Its default size is 10.
For example, the settings could be applied within the code like the following Python code:
class exampleSpider(scrapy.Spider): name = 'example' custom_settings = { 'CONCURRENT_REQUESTS': 25, 'CONCURRENT_REQUESTS_PER_DOMAIN': 100, 'DOWNLOAD_DELAY': 0 } f = open("example") start_urls = [url.strip() for url in f.readlines()] f.close() def parse(self, response): for itemin response.xpath("//div[@class=<class_component>]"): urlgem = item.xpath(".//div[@class=<class_component>]/a/@href").extract()
- AWS_ACCESS_KEY_ID
With this you can set AWS ID within your scrapy, it is used to access Amazon Web Services. It defaults to the “None” value.
- AWS_SECRET_ACCESS_KEY
With this you can set AWS Access Key (Password or ID credential) within your scrapy, it is used to access Amazon Web Services. It defaults to the “None” value.
- DEPTH_LIMIT
The limiting depth for the spider to crawl a target site. It defaults to 0.
- DEPTH_PRIORITY
It further manages the priority of the depth to crawl a target site. It also defaults to 0.
This is a basic layout of the selector graph inside the Scrapy. The components could be built inside this Selector Graph. Each component is responsible for scraping individual items from the site.
- DEPTH_STATS
With this setting, we can also collect the Depth Stats within the logs of the level crawled. If the setting is enabled then the value of each individual request for each depth is collected in the stats. Its default is “True”.
- DEPTH_STATS_VERBOSE
Further improves the DEPTH_STATS by enabling the number of requests which are collected in stats for each verbose depth.
By default, it is “False”.
Selector levels can extend up to infinite depth as structured by the webmaster. With the various depth settings, it’s our duty to limit the Selector Graph within our crawler.
- DNSCACHE_ENABLED
With this setting, we could enable DNS inside a memory cache. By default, it is “True”.
- DNSCACHE_SIZE
With this setting, we could define the size of the DNS in-memory cache. Its default value is 10000.
- DNS_TIMEOUT
It is the time-out time for the DNS to process the scrapy query. It defaults to 60.
- DOWNLOADER
The actual downloader used by the crawler. The default format is given by ‘scrapy.core.downloader.Downloader’.
- DOWNLOADER_MIDDLEWARES
The dictionary holds downloading middleware and its orders. It is by default empty.
- EXTENSIONS_BASE
The dictionary with a built-in extension value. It is defaulted by value: { ‘scrapy.extensions.corestats.CoreStats’: 0, }
- FEED_TEMPDIR
This is a directory that is used to set the custom folder which stores the crawler temporary files.
- ITEM_PIPELINES
We can define the scrapy dictionary as having pipelines, this represents the pipelines joining each item class. It defaults to the value null.
- LOG_STDOUT
With this setting, if set to true, all the concurrent process output will appear in the log file. Its default value is False.
Setting up the values
It is advisable to put these values manually inside the settings.py file. Still, there is also an option to modify these values using the command line.
For Example:
If you want to generate a scrapy log file use the following command.
scrapy crawl myspider -s LOG_FILE=scrapy.log
Conclusion: This is the most important file of the scrapy. Only with this file, you may be able to customize the behaviour of all Scrapy components.