Saturday, November 16, 2024
Google search engine
HomeLanguagesAutomated Website Scraping using Scrapy

Automated Website Scraping using Scrapy

Scrapy is a Python framework for web scraping on a large scale. It provides with the tools we need to extract data from websites efficiently, processes it as we see fit, and store it in the structure and format we prefer. Zyte (formerly Scrapinghub), a web scraping development and services company, currently maintains it.

What is a Web Crawler (Web Spider)?

It is a web indexing bot that collects information and related links from websites. The goal of such a bot is to learn about every webpage on the internet so that it can be retrieved when needed.

Search engines almost always operate these bots. Search engines can provide relevant links in response to user search queries by applying a search algorithm to the data collected by web crawlers, generating the list of webpages that appear after a user types a search into Google or others.

Example: Amazonbot is the Amazon web crawler. Bingbot is Microsoft’s search engine crawler for Bing and Googlebot is the crawler for Google’s search engine. 

Requirements for starting the Web Scraping

We must first install Scrapy on our system. Open a terminal window in VScode or command prompt and type the below command.

pip install scrapy

Automated Website Scraping using Scrapy

 

Step-by-step implementation of the Code

Step 1: To start the project just type the below command in a terminal.

scrapy startproject<projectname>

 The following directories will be created automatically in our project folder.

Automated Website Scraping using Scrapy

 

There is a spider directory where we will put our spiders. Now let’s create a spider. First, in spider directory, we are going to create a file “spider1.py”. we are going to write our code in this file.

Automated Website Scraping using Scrapy

 

 Step 2: After Successfully Installing the module,  Import the Module.

Python




# import scrapy module
import scrapy


Step 3: create_spider class is created, which is accepting the Spider as an argument.

Initializing the scrapy with the name “spid”.

start_requests function will loop through the list of URLs and Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request. The callback function is parse.

Python




# Spider Class
class create_spider(scrapy.Spider):
     
    # Name of the spider or scrapy
    name = "spid"
 
    # Request function for getting response object from urls.
    def start_requests(self):
        urls = [
            'http://scanme.nmap.org/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


Step 4: The parse function will receive the response object and split the URL before storing the body of the response objection in an HTML file and saving it.

Python




# Parse function taking the response object
    def parse(self, response):
         
        # Split the url
        page = response.url.split("/")[-2]
         
        # Creating a file with html extension
        filename = f'spdr-{page}.html'
         
        # Open the file
        with open(filename, 'wb') as f:
            # writing the content of the response into file
            f.write(response.body)
            # saving the file
        self.log(f'Saved file {filename}')


Combined the above codes:

Python




import scrapy
 
# Spider Class
 
 
class create_spider(scrapy.Spider):
 
    # Name of the spider or scrapy
    name = "spid"
 
    # Request function for getting response object from urls.
    def start_requests(self):
        urls = [
            'http://scanme.nmap.org/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
 
     # Parse function taking the response object
    def parse(self, response):
 
        # Split the url
        page = response.url.split("/")[-2]
 
        # Creating a file with html extension
        filename = f'spdr-{page}.html'
 
        # Open the file
        with open(filename, 'wb') as f:
            # writing the content of the response into file
            f.write(response.body)
            # saving the file
        self.log(f'Saved file {filename}')


Step 5: Now to start the spider just open the terminal and run the following commands.

Go to the working directory by executing the below command.

cd <project_directory_name>
scrapy crawl <name>

Here the name is the value of the name variable in the spider1.py file. 

For Example – scrapy crawl spid

In our case the value of the name variable is spid. ( See the spider1.py file).

A New HTML file will be created automatically. 

Automated Website Scraping using Scrapy

 

Extracting the Information – To extract information from the webpage we can use CSS selector or XPath links.

Step 6: Execute the below command to start a python shell that interact with scrapy spider.

scrapy shell <target url>                   

       In this article, our target URL is “http://scanme.nmap.org/“.

Automated Website Scraping using Scrapy

 

We will use the CSS selectors. Observe the HTML code and find out the selectors which may provide us with extra information. 

For example: Our target website contains an element <main></main> with id content. we will use this selector and will extract the content of this element. 

Automated Website Scraping using Scrapy

 

To perform this task type the following command in the terminal scrapy shell.

response.css(‘#content’).extract()

Output:

Automated Website Scraping using Scrapy

 

Similarly, we can use other element selectors.

Other selector examples:

a::attr(href):  It will extract the value of the href attribute or the links of the <a></a> tag. 

response.css(‘a::attr(href)’).getall()

Automated Website Scraping using Scrapy

 

We may use XPath in place of CSS selectors – 

Example:

response.selector.xpath(‘//p/text()’).getall()

Out[10]: 

[‘Hello, and welcome to Scanme.Nmap.Org, a service provided by the ‘,
‘.\n\n’,
“We set up this machine to help folks learn about Nmap and also to test and make sure that their Nmap installation (or Internet connection) is working properly.  We are authorized to scan this machine with Nmap or other port scanners.  Try not to hammer on the server too hard.  A few scans in a day is fine, but don’t scan 100 times a day or use this site to test our ssh brute-force password cracking tool.\n\n”,
‘Thanks’,
‘\n-‘,
‘\n’]
 

We can use the scrapy for a variety of other tasks. To find out more about each class, spider, and object in the scrapy, look through the scrapy documentation. In this article, we learned how to extract content from websites using scrapy.

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments