In this article, we are going to see how to write scrapy output into a JSON file in Python.
Using scrapy command-line shell
This is the easiest way to save data to JSON is by using the following command:
scrapy crawl <spiderName> -O <fileName>.json
This will generate a file with a provided file name containing all scraped data.
Note that using -O in the command line overwrites any existing file with that name whereas using -o appends the new content to the existing file. However, appending to a JSON file makes the file contents invalid JSON. So use the following command to append data to an existing file.
scrapy crawl <spiderName> -o <fileName>.jl
Note: .jl represents JSON lines format.
Stepwise implementation:
Step 1: Creating the project
Now to start a new project in scrapy use the following command
scrapy startproject tutorial
This will create a directory with the following content:
Move to the tutorial directory we created using the following command:
cd tutorial
Step 2: Creating a spider (tutorial/spiders/quotes_spider.py)
Spiders are the programs that user defines and scrapy uses to scrape information from website(s). This is the code for our Spider. Create a file named quotes_spider.py under the tutorial/spiders directory in your project:
Python3
import scrapy class QuotesSpider(scrapy.Spider): # name of variable should be 'name' only name = "quotes" # urls from which will be used to extract information # list should be named 'start_urls' only start_urls = [ ] def parse( self , response): # handle the response downloaded for each of the # requests made should be named 'parse' only for quote in response.css( 'div.quote' ): yield { 'text' : quote.css( 'span.text::text' ).get(), 'author' : quote.css( 'small.author::text' ).get(), 'tags' : quote.css( 'div.tags a.tag::text' ).getall(), } |
This is a simple spider to get the quotes, author names, and tags from the website.
Step 5: Running the program
To run the program and save scrawled data to JSON using:
scrapy crawl quotes -O quotes.json
We can see that a file quotes.json has been created in our project structure, this file contains all the scraped data.
JSON Output:
These are just a few of many quotes of quotes.json file scraped by our spider.