Scrapy is a popular Python library for web scraping, which provides an easy and efficient way to extract data from websites for a variety of tasks including data mining and information processing. In addition to being a general-purpose web crawler, Scrapy may also be used to retrieve data via APIs.
One of the most common data formats returned by APIs is JSON, which stands for JavaScript Object Notation. In this article, we’ll look at how to scrape a JSON response using Scrapy.
To install Scrapy write the following command in your command line or on your terminal:
pip install scrapy
Example
Now we’ll look at an example to extract data from the bored public API endpoint (https://www.boredapi.com/api/activity).
Here’s what the actual data returned looks like:
{ "activity": "Learn calligraphy", "type": "education", "participants": 1, "price": 0.1, "link": "", "key": "4565537", "accessibility": 0.1 }
Python3
# import modules import scrapy import json class Spider(scrapy.Spider): name = "bored" def start_requests( self ): yield scrapy.Request(url, self .parse) def parse( self , response): data = json.loads(response.text) activity = data[ "activity" ] type = data[ "type" ] participants = data[ "participants" ] yield { "Activity" : activity, "Type" : type , "Participants" : participants} |
Explanation:
Here we have a Scrapy spider named Spider. The spider has 3 main parts:
- The name variable – sets the name of the spider to “bored”.
- The start_requests method – initiates the request to the API endpoint at “https://www.boredapi.com/api/activity”. The method yields a Scrapy request object and passes it to the parse method.
- The parse method – handles the response from the API endpoint. The method loads the JSON response data into a Python dictionary using the json.loads function. Then, it extracts the values of the “activity”, “type”, and “participants” keys from the dictionary and stores them in variables with the same names. Finally, it yields a dictionary with the activity, type, and participants as keys and their corresponding values.
To run this file type the following into your terminal:
scrapy runspider <file name>
Output:
Now, this output will contain a lot of unnecessary lines so it’ll be better to store your parsed responses in a separate file. You can do it by adding a -o tag to the command for the output file.
The “-L ERROR” is added to prevent any outputs other than error messages.
activity.json looks like this: