Saturday, November 16, 2024
Google search engine
HomeLanguagesWriting Scrapy Python Output to JSON file

Writing Scrapy Python Output to JSON file

In this article, we are going to see how to write scrapy output into a JSON file in Python.

Using  scrapy command-line shell

This is the easiest way to save data to JSON is by using the following command:

scrapy crawl <spiderName> -O <fileName>.json

This will generate a file with a provided file name containing all scraped data.

Note that using -O in the command line overwrites any existing file with that name whereas using -o appends the new content to the existing file. However, appending to a JSON file makes the file contents invalid JSON. So use the following command to append data to an existing file.

scrapy crawl <spiderName> -o <fileName>.jl

Note: .jl represents JSON lines format.

Stepwise implementation:

Step 1: Creating the project

Now to start a new project in scrapy use the following command

scrapy startproject tutorial

This will create a directory with the following content:

Move to the tutorial directory we created using the following command:

cd tutorial

Step 2: Creating a spider (tutorial/spiders/quotes_spider.py)

Spiders are the programs that user defines and scrapy uses to scrape information from website(s). This is the code for our Spider. Create a file named quotes_spider.py under the tutorial/spiders directory in your project:

Python3




import scrapy
  
  
class QuotesSpider(scrapy.Spider):
    
    # name of variable should be 'name' only
    name = "quotes" 
  
    # urls from which will be used to extract information
    # list should be named 'start_urls' only
    start_urls = [
    ]
  
    def parse(self, response):
        
        # handle the response downloaded for each of the
        # requests made should be named 'parse' only
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }


This is a simple spider to get the quotes, author names, and tags from the website.

Step 5: Running the program

To run the program and save scrawled data to JSON using:

scrapy crawl quotes -O quotes.json

We can see that a file quotes.json has been created in our project structure, this file contains all the scraped data.

JSON Output:

These are just a few of many quotes of quotes.json file scraped by our spider.

RELATED ARTICLES

Most Popular

Recent Comments