Automated Website Scraping using Scrapy

23 July 2024

1

Scrapy is a Python framework for web scraping on a large scale. It provides with the tools we need to extract data from websites efficiently, processes it as we see fit, and store it in the structure and format we prefer. Zyte (formerly Scrapinghub), a web scraping development and services company, currently maintains it.

What is a Web Crawler (Web Spider)?

It is a web indexing bot that collects information and related links from websites. The goal of such a bot is to learn about every webpage on the internet so that it can be retrieved when needed.

Search engines almost always operate these bots. Search engines can provide relevant links in response to user search queries by applying a search algorithm to the data collected by web crawlers, generating the list of webpages that appear after a user types a search into Google or others.

Example: Amazonbot is the Amazon web crawler. Bingbot is Microsoft’s search engine crawler for Bing and Googlebot is the crawler for Google’s search engine.

Requirements for starting the Web Scraping

We must first install Scrapy on our system. Open a terminal window in VScode or command prompt and type the below command.

pip install scrapy

Step-by-step implementation of the Code

Step 1: To start the project just type the below command in a terminal.

scrapy startproject<projectname>

The following directories will be created automatically in our project folder.

There is a spider directory where we will put our spiders. Now let’s create a spider. First, in spider directory, we are going to create a file “spider1.py”. we are going to write our code in this file.

Step 2: After Successfully Installing the module, Import the Module.

Python

# import scrapy module
import scrapy

Step 3: create_spider class is created, which is accepting the Spider as an argument.

Initializing the scrapy with the name “spid”.

start_requests function will loop through the list of URLs and Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request. The callback function is parse.

Python

# Spider Class
class create_spider(scrapy.Spider):
     
    # Name of the spider or scrapy
    name = "spid"
 
    # Request function for getting response object from urls.
    def start_requests(self):
        urls = [
            'http://scanme.nmap.org/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

Step 4: The parse function will receive the response object and split the URL before storing the body of the response objection in an HTML file and saving it.

Python

# Parse function taking the response object 
    def parse(self, response):
         
        # Split the url 
        page = response.url.split("/")[-2]
         
        # Creating a file with html extension
        filename = f'spdr-{page}.html'
         
        # Open the file 
        with open(filename, 'wb') as f:
            # writing the content of the response into file
            f.write(response.body)
            # saving the file
        self.log(f'Saved file {filename}')

Combined the above codes:

Python

import scrapy
 
# Spider Class
 
 
class create_spider(scrapy.Spider):
 
    # Name of the spider or scrapy
    name = "spid"
 
    # Request function for getting response object from urls.
    def start_requests(self):
        urls = [
            'http://scanme.nmap.org/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
 
     # Parse function taking the response object
    def parse(self, response):
 
        # Split the url
        page = response.url.split("/")[-2]
 
        # Creating a file with html extension
        filename = f'spdr-{page}.html'
 
        # Open the file
        with open(filename, 'wb') as f:
            # writing the content of the response into file
            f.write(response.body)
            # saving the file
        self.log(f'Saved file {filename}')

Step 5: Now to start the spider just open the terminal and run the following commands.

Go to the working directory by executing the below command.

cd <project_directory_name>
scrapy crawl <name>

Here the name is the value of the name variable in the spider1.py file.

For Example – scrapy crawl spid

In our case the value of the name variable is spid. ( See the spider1.py file).

A New HTML file will be created automatically.

Extracting the Information – To extract information from the webpage we can use CSS selector or XPath links.

Step 6: Execute the below command to start a python shell that interact with scrapy spider.

scrapy shell <target url>

In this article, our target URL is “http://scanme.nmap.org/“.

We will use the CSS selectors. Observe the HTML code and find out the selectors which may provide us with extra information.

For example: Our target website contains an element <main></main> with id content. we will use this selector and will extract the content of this element.

To perform this task type the following command in the terminal scrapy shell.

response.css(‘#content’).extract()

Output:

Similarly, we can use other element selectors.

Other selector examples:

a::attr(href): It will extract the value of the href attribute or the links of the <a></a> tag.

response.css(‘a::attr(href)’).getall()

We may use XPath in place of CSS selectors –

Example:

response.selector.xpath(‘//p/text()’).getall()

Out[10]:

[‘Hello, and welcome to Scanme.Nmap.Org, a service provided by the ‘,
‘.\n\n’,
“We set up this machine to help folks learn about Nmap and also to test and make sure that their Nmap installation (or Internet connection) is working properly. We are authorized to scan this machine with Nmap or other port scanners. Try not to hammer on the server too hard. A few scans in a day is fine, but don’t scan 100 times a day or use this site to test our ssh brute-force password cracking tool.\n\n”,
‘Thanks’,
‘\n-‘,
‘\n’]

We can use the scrapy for a variety of other tasks. To find out more about each class, spider, and object in the scrapy, look through the scrapy documentation. In this article, we learned how to extract content from websites using scrapy.

Automated Website Scraping using Scrapy

What is a Web Crawler (Web Spider)?

Requirements for starting the Web Scraping

Step-by-step implementation of the Code

Python

Python

Python

Python

Extracting the Information – To extract information from the webpage we can use CSS selector or XPath links.

Other selector examples:

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Is Microsoft Teams Secure? Use Teams Safely in 2024 by Tyler Cross

Recent Comments

EDITOR PICKS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR POSTS

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US