Scrapy – Feed exports

22 July 2024

0

Scrapy is a fast high-level web crawling and scraping framework written in Python used to crawl websites and extract structured data from their pages. It can be used for many purposes, from data mining to monitoring and automated testing.

This article is divided into 2 sections:

Creating a Simple web crawler to scrape the details from a Web Scraping Sandbox website (http://books.toscrape.com/)
Exploring how Scrapy Feed exports can be used to store the scraped data to export files in various formats.

Creating a Simple web crawler

We are going to create a web crawler to scrape all the book details(URL, Title, Price) from a Web Scraping Sandbox website

1. Installation of packages – run the following command from the terminal

pip install scrapy

2. Create a Scrapy project – run the following command from the terminal

scrapy startproject booklist
cd booklist 
scrapy genspider book http://books.toscrape.com/

Here,

Project Name: “booklist”
Spider Name: “book”
Domain to be Scraped: “http://books.toscrape.com/”

Directory Structure:

Directory Structure

3. Create an Item – replace the contents of “booklist\items.py” file with the below code

We define each Item scraped from the website as an object with the following 3 fields:

url
title
price

Python

# booklist\items.py 
  
# Define here the models for your scraped items 
from scrapy.item import Item, Field 
  
class BooklistItem(Item): 
    url = Field() 
    title = Field() 
    price = Field() 

4. Define the Parse function – Add the following code to “booklist\spiders\book.py”

The response from the crawler is parsed to extract the book details (i.e. URL, Title, Price) as shown in the below code

Python

# booklist\spiders\book.py 
  
import scrapy 
from booklist.items import BooklistItem 
  
class BookSpider(scrapy.Spider): 
    name = 'book'
    allowed_domains = ['books.toscrape.com'] 
    start_urls = ['http://books.toscrape.com/'] 
  
    def parse(self, response): 
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            ) 
            yield book_item 

5. Run the spider using following command:

scrapy crawl book

Output:

Output Items

Scrapy Feed Exports

One of the most frequently required features when implementing scrapers is being able to store the scraped data as an “export file”.

Scrapy provides this functionality out of the box with the Feed Exports, which allows to generate feeds with the scraped items, using multiple serialization formats and storage backends.

The different file formats supported are:

CSV
JSON
JSON Lines
XML

1. Saving Files via the Command Line

The simplest way to export the file of the data scraped, is to define a output path when starting the spider in the command line.

Add the flag -o to the scrapy crawl command along with the file path you want to save the file to

CSV – “data/book_data.csv”
JSON – “data/book_data.json”
JSON Lines – “data/book_data.jsonl”
XML – “data/book_data.xml”

scrapy crawl book -o data/book_data.csv
scrapy crawl book -o data/book_data.json
scrapy crawl book -o data/book_data.jsonl
scrapy crawl book -o data/book_data.xml

There are 2 options to using this command:

Flag	Description
-o	Appends new data to an existing file.
-O	Overwrites any existing file with the current data.

2. Saving Files using Feed Exports

For serializing the scraped data, the Feed Exports internally use the Item Exporters.

Saving the Data via FEEDS setting:

The scraped data can stored by defining the FEEDS setting in the “booklist\settings.py” by passing it a dictionary with the path/name of the file and the file format

Python

# booklist\settings.py 
  
# To store in CSV format 
FEEDS = { 
    'data/book_data.csv': {'format': 'csv', 'overwrite': True} 
}

Python

# booklist\settings.py 
  
# To store in JSON format 
FEEDS = { 
    'data/book_data.json': {'format': 'json', 'overwrite': True} 
}

Python

# booklist\settings.py 
  
# To store in JSON Lines format 
FEEDS = { 
    'data/book_data.jsonl': {'format': 'jsonlines', 'overwrite': True} 
}

Python

# booklist\settings.py 
  
# To store in XML Lines format 
FEEDS = { 
    'data/book_data.xml': {'format': 'xml', 'overwrite': True} 
}

Saving the Data via custom_settings:

The scraped data can also be stored by configuring the FEEDS setting in each individual spider by setting a custom_setting in the spider (“booklist\spiders\book.py”) for various file formats required.

Python

# booklist\spiders\book.py 
  
import scrapy 
from booklist.items import BooklistItem 
  
class BookSpider(scrapy.Spider): 
    name = 'book'
    allowed_domains = ['books.toscrape.com'] 
    start_urls = ['http://books.toscrape.com/'] 
      
    # To store in CSV format 
    custom_settings = { 
        'FEEDS': {'data.csv': {'format': 'csv', 'overwrite': True}} 
    } 
  
    def parse(self, response): 
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            ) 
            yield book_item 

Python

# booklist\spiders\book.py 
  
import scrapy 
from booklist.items import BooklistItem 
  
class BookSpider(scrapy.Spider): 
    name = 'book'
    allowed_domains = ['books.toscrape.com'] 
    start_urls = ['http://books.toscrape.com/'] 
  
    # To store in JSON format 
    custom_settings = { 
        'FEEDS': {'data.json': {'format': 'json', 'overwrite': True}} 
    } 
  
    def parse(self, response): 
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            ) 
            yield book_item 

Python

# booklist\spiders\book.py 
  
import scrapy 
from booklist.items import BooklistItem 
  
class BookSpider(scrapy.Spider): 
    name = 'book'
    allowed_domains = ['books.toscrape.com'] 
    start_urls = ['http://books.toscrape.com/'] 
  
    # To store in JSON Lines format 
    custom_settings = { 
        'FEEDS': {'data.jsonl': {'format': 'jsonlines', 'overwrite': True}} 
    } 
  
    def parse(self, response): 
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            ) 
            yield book_item 

Python

# booklist\spiders\book.py 
  
import scrapy 
from booklist.items import BooklistItem 
  
class BookSpider(scrapy.Spider): 
    name = 'book'
    allowed_domains = ['books.toscrape.com'] 
    start_urls = ['http://books.toscrape.com/'] 
      
    # To store in XML format 
    custom_settings = { 
        'FEEDS': {'data.xml': {'format': 'xml', 'overwrite': True}} 
    } 
  
    def parse(self, response): 
        for article in response.css('article.product_pod'): 
            book_item = BooklistItem( 
                url=article.css("h3 > a::attr(href)").get(), 
                title=article.css("h3 > a::attr(title)").extract_first(), 
                price=article.css(".price_color::text").extract_first(), 
            ) 
            yield book_item 

Setting Dynamic File Paths/Names:

The generated data files can be stored using dynamic Path/Name as follows:

Below code creates a JSON file in the data folder, followed by the subfolder with the spiders name, and a file name that includes the spider name and date it was scraped.

Python

# settings.py  
  
FEEDS = { 
    'data/%(name)s/%(name)s_%(time)s.json': { 
        'format': 'json', 'overwrite': True
        } 
}

The generated path would look something like this.

"data/book/book_2023-01-15T16-53-02.json"

Saved Files Output:

Saved Files Output

Scrapy – Feed exports

This article is divided into 2 sections:

Creating a Simple web crawler

Python

Python

Output:

Scrapy Feed Exports

1. Saving Files via the Command Line

2. Saving Files using Feed Exports

Python

Python

Python

Python

Python

Python

Python

Python

Python

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Recent Comments

EDITOR PICKS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR POSTS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR CATEGORY

ABOUT US

FOLLOW US