Friday, December 27, 2024
Google search engine
HomeLanguagesScrapy – Settings

Scrapy – Settings

Scrapy is an open-source tool built with Python Framework. It presents us with a strong and robust web crawling framework that can easily extract the info from the online page with the assistance of selectors supported by XPath.

We can define the behavior of Scrapy components with the help of Scrapy settings. Pipelines and setting files are very important for scrapy. It is the core of automating the task. These rules help with inserting data into the database. These files are includes when we start with the base template. The Scrapy settings allow you to customize the behavior of all Scrapy components, including the core, extensions, pipelines, and spiders themselves.

We are often presented with the situation where we need to define multiple crapper projects in that case we can define which individual project with the help of scrapy settings. For this, the environment variable SCRAPY_SETTINGS_MODULE should be used and its value should be in Python path syntax. Hence, with the help of the Scrapy settings, the mechanism for choosing the currently active Scrapy project could be specified.

The infrastructure of the settings provides a worldwide namespace of key-value mappings that the code can use to tug configuration values from. The settings are often populated through different mechanisms, which are described below.

Use these commands to start the scrapy template folder.

scrapy startproject <project_name>

This is the base outline of the scrapy project.

With this article, we would be focusing on the settings.py file. 

The settings.py file looks something like this. We are provided with this as our default settings.

Most commonly used settings and their description is given below:

Important Scrapy Settings

  • BOT_NAME

It is the name of the project. The bot symbolizes the automation that we are doing with the help of the scraper. It defaults to ‘scrapybot’. Also as seen in the screenshot it is automatically available with your project name when you start the project.

  • USER_AGENT

User-Agent helps us with the identification. It basically tells “who you are” to the servers and network peers. It helps with the identification of the application, OS, vendor, and/or version of the requesting user agent. It defaults to “Scrapy/VERSION (+https://scrapy.org)” while crawling unless explicitly specified. 

The common format for browsers:

User-Agent: <browser>/<version> (<system-info>) <platform> (<platform-details>) <extensions>

For Example:

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
  • ROBOTSTXT_OBEY

A robots.txt file basically tells the crawlers from search engines which pages it could request from the site. ROBOTSTXT_OBEY defaults to “False”. It is mostly kept enabled, so our scrapy will respect the robots.txt policies by the website. 

The image shows the content of the file robots.txt, the policies are written here are managed by the ROBOTSTXT_OBEY setting.

  • CONCURRENT_REQUESTS

It is basically asking the website to open up. It defaults to 16. So basically it is the maximum number of the request that the crawler will perform.

More request increases a load to the server so keeping it as low as 16 or 32 is a good value.

  • CONCURRENT_ITEMS

It means while scraping the data what a maximum number of concurrent items the scrapy will process in parallel per response. It defaults to 100, which is again a good value.

custom_settings = {
   'CONCURRENT_REQUESTS' = 30,
   'CONCURRENT_ITEMS' = 80,
}
  • CONCURRENT_REQUESTS_PER_DOMAIN

It means while scraping the data what is the maximum number of existing requests that can be performed concurrently for any single domain value. It defaults to value ‘8’.

  • CONCURRENT_REQUESTS_PER_IP

It means while scraping the data what is the maximum number of existing requests that can be performed concurrently for any single IP address. It defaults to the value ‘0’.

custom_settings = {
    'CONCURRENT_REQUESTS_PER_DOMAIN' = 8,
    'CONCURRENT_REQUESTS_PER_IP' = 2
}
  • DOWNLOAD_DELAY

It is the delay in the amount of time that the downloader would before again downloading the pages from the website. This again is used to limit the load on the server where the website is hosted. It defaults to 0.

For Example:

DOWNLOAD_DELAY = 0.25    # 250 ms of delay
  • DOWNLOAD_TIMEOUT

It is the time-out time. Tells scrapy to wait for the given amount of time to wait before the downloader times out. It defaults to 180.

  • LOG_ENABLED

It is used to enable or disable the logging for the scrapper. It defaults to “True”.

  • FTP_PASSWORD

Used to set a password for the FTP connections. The value is used only when there is no “ftp_password” in Request meta. It defaults to “guest”.

  • FTP_USER

Used to set a username for the FTP connections. The value is used only when there is no “ftp_user” in Request meta. It defaults to “anonymous”.

  • DEFAULT_ITEM_CLASS

This setting is used to represent items within a scrapy, the values are stored in this class format specified by DEFAULT_ITEM_CLASS. The default format is given by ‘scrapy.item.Item’.

  • DEFAULT_REQUEST_HEADERS

The given setting lists the default header used for HTTP requests made by Scrapy. It is populated within the DefaultHeadersMiddleware.

The default header value is given by:

{
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
}
  • REACTOR_THREADPOOL_MAXSIZE

The reactor thread pool could also be set within the scrapy. It binds the max size for the reactor thread pool of the spider. Its default size is 10.

For example, the settings could be applied within the code like the following Python code:

class exampleSpider(scrapy.Spider):
  name = 'example'
  custom_settings = {
      'CONCURRENT_REQUESTS': 25,
      'CONCURRENT_REQUESTS_PER_DOMAIN': 100,
      'DOWNLOAD_DELAY': 0
  }
 
  f = open("example")
  start_urls = [url.strip() for url in f.readlines()]
  f.close()

  def parse(self, response):
      for itemin response.xpath("//div[@class=<class_component>]"):
          urlgem = item.xpath(".//div[@class=<class_component>]/a/@href").extract()
  • AWS_ACCESS_KEY_ID

With this you can set AWS ID within your scrapy, it is used to access Amazon Web Services. It defaults to the “None” value. 

  • AWS_SECRET_ACCESS_KEY

With this you can set AWS Access Key (Password or ID credential) within your scrapy, it is used to access Amazon Web Services. It defaults to the “None” value. 

  • DEPTH_LIMIT

The limiting depth for the spider to crawl a target site. It defaults to 0.

  • DEPTH_PRIORITY

It further manages the priority of the depth to crawl a target site. It also defaults to 0.

This is a basic layout of the selector graph inside the Scrapy. The components could be built inside this Selector Graph. Each component is responsible for scraping individual items from the site.

  • DEPTH_STATS

With this setting, we can also collect the Depth Stats within the logs of the level crawled. If the setting is enabled then the value of each individual request for each depth is collected in the stats. Its default is “True”.

  • DEPTH_STATS_VERBOSE

Further improves the DEPTH_STATS by enabling the number of requests which are collected in stats for each verbose depth.

By default, it is “False”.

Selector levels can extend up to infinite depth as structured by the webmaster. With the various depth settings, it’s our duty to limit the Selector Graph within our crawler.

  • DNSCACHE_ENABLED

With this setting, we could enable DNS inside a memory cache. By default, it is “True”.

  • DNSCACHE_SIZE

With this setting, we could define the size of the DNS in-memory cache. Its default value is 10000.

  • DNS_TIMEOUT

It is the time-out time for the DNS to process the scrapy query. It defaults to 60.

  • DOWNLOADER

The actual downloader used by the crawler. The default format is given by ‘scrapy.core.downloader.Downloader’.

  • DOWNLOADER_MIDDLEWARES

The dictionary holds downloading middleware and its orders. It is by default empty.

  • EXTENSIONS_BASE

The dictionary with a built-in extension value. It is defaulted by value: { ‘scrapy.extensions.corestats.CoreStats’: 0, }

  • FEED_TEMPDIR

This is a directory that is used to set the custom folder which stores the crawler temporary files.

  • ITEM_PIPELINES

We can define the scrapy dictionary as having pipelines, this represents the pipelines joining each item class. It defaults to the value null.

  • LOG_STDOUT

With this setting, if set to true, all the concurrent process output will appear in the log file. Its default value is False.

Setting up the values

It is advisable to put these values manually inside the settings.py file. Still, there is also an option to modify these values using the command line.

For Example: 

If you want to generate a scrapy log file use the following command.

scrapy crawl myspider -s LOG_FILE=scrapy.log

Conclusion: This is the most important file of the scrapy. Only with this file, you may be able to customize the behaviour of all Scrapy components.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments