World Wide Web holds large amounts of data available that is consistently growing both in quantity and to a fine form. Python API allows us to collect data/information of interest from the World Wide Web. API is a very useful tool for data scientists, web developers, and even any casual person who wants to find and extract information programmatically.
API vs Web Scraping
Well, most of the websites provide APIs to share data in a structured format, however, they typically restrict the data that is available and also might put a limit on how frequently it can be accessed. Additionally, a website developer might change, remove, or restrict the backend API.
On other hand, there are websites that do not provide API to share the data. The website development team at any time can change, remove, or restrict backend API. In short, we cannot rely on APIs to access the online data we may want. Therefore, we may need to rely on web scraping techniques.
Python version
When it comes to effective API, Python is usually the programming language of choice. It is easy to use a programming language that has a very rich ecosystem of tools for many tasks. If you program in other languages, you will find it easy to pick up Python and you may never go back.
The Python Software Foundation has announced Python 2 will be phased out of development and support in 2020. For this reason, We will use Python 3 and Jupyter notebook through the post. To be more specific, my python version is :
Python3
from platform import python_version print (python_version()) |
3.6.10
Structure of Target website
Before attempting to access the content of a website by API or web crawling, we should always develop an understanding of the structure of our target website. The sitemap and robots.txt of a website help us with some vital information apart from external tools such as Google Search and WHOIS.
Validating robots.txt file
Well, websites (most of them) define a robots.txt file to note the users about the restrictions, when accessing their website. However, these restrictions are guidelines only, and highly recommend respecting their guidelines. You should always validate and respect the contents inside the robots.txt to understand the structure of the website and minimize the chance of being blocked.
The robots.txt file is a valuable resource to validate before taking a decision to write a web crawler program or to use an API.
Understanding the problem
In this post, Now gather the JavaScript repositories with the highest stars from Developers Facebook famously known as Github, so let me first checkout their robots.txt file.
The following content (first few lines only) is from the robots.txt file of the website – https://github.com/robots.txt.
From the file it is clear, Github wants to use its contents using an API. One way of solving our problem is by putting our search criteria in the Github search box and pressing enter, however, it is a manual activity.
Helpfully, Github exposes this search capability as an API we can consume from our own applications. Github’s Search API gives us access to the built-in search function. This includes the use of logical and scoping operators, like “or” and “user”.
Before we jump into the code, there is something you should know about public repositories, private repositories, and access restrictions. Public repositories are usually open to the public with no restrictions while private repositories are restricted only to the owners and to the collaborators they choose.
Step 1: Validating with cURL.
Now let’s quickly validate the access to Github before putting the effort into writing an API. So to do that cURL, a simple command-line HTTP tool, is a perfect fit. cURL is usually installed on most of the Linux machines if not, you can easily do it using. – yum install curl
For windows, get a copy from “https://curl.haxx.se/download.html”.
Now run the command as shown below:
The cURL has given us a lot of information:
- HTTP/1.1 200 OK – code When your request destination URL and associated parameters are correct, GitHub will respond with a 200 status(Success).
- X-RateLimit-Limit – The maximum number of requests you’re permitted to make per hour.
- X-RateLimit-Remaining – The number of requests remaining in the current rate limit window.
- X-RateLimit-Reset – the time at which the current rate limit window resets in UTC epoch seconds.
- “repository_search_url“: This is the one we will be using in this post to query the repositories.
Step 2: Authentication
Usually, there are a couple of ways to authenticate when making a request to the Github API – using username and passwords (HTTP Basic) and using OAuth tokens. The authentication details will not be covered in this post.
Since Github allows us to access the public content without any authentication, we will stick to searching public repositories without API. It means that we are going to write an API that doesn’t require authentication, so we will be searching public repositories only.
Step 3: Github Response with Python
Python3
# 1 - imports import requests # 2 - set the siteurl # 3 - set the headers headers = { 'Accept' : 'application/vnd.github.v3+json' } # 4 - call the url with headers and save the response response = requests.get(site_url, headers = headers) # 5 - Get the response print (f "Response from {site_url} is {response.status_code} " ) |
Output:
We started with importing requests (if it’s missing installation using pip install requests) then assigning a variable site_url with the URL of our interest. If you wanted to search for JavaScript repositories with a sorting (descending) on maximum stars.
Github is currently on the third version of its API, so defined headers for the API call that ask explicitly to use the 3rd version of the API. Feel free to always check out the latest version here – https://docs.github.com/en/free-pro-team@latest/developers/overview/about-githubs-apis.
Then call get() and pass it the site_url and the header, the response object is assigned to the response variable. The response from Github is always a JSON. The response object has an attribute status_code, which tells whether the response is successful(200) or not.
Step 4: Converting JSON response to Python dictionary
Python3
response_json = response.json() print (f "keys in the Json file : {response_json.keys()}" ) print (f "Total javascript repositories in GitHub : {response_json['total_count']}" ) |
Output:
As mentioned earlier, the response is JSON. Our JSON has three keys of which we can ignore “incomplete_results” for such a small API. A program output displayed the total repositories in Github returned for our search with response_json[‘total_count’].
Step 5: Looking at our first repository
Python3
repositories = response_json[ 'items' ] first_repo = repositories[ 0 ] print (f "Output \n *** Repository information keys total - {len(first_repo)} - values are -\n" ) for keys in sorted (first_repo.keys()): print (keys) print (f " *** Repository name - {first_repo['name']}, Owner - {first_repo['owner']['login']}, total watchers - {first_repo['watchers_count']} " ) |
Output:
The above code is self-explanatory. What we are doing is displaying all the keys inside the dictionary and then displaying information on our first repository.
Step 6: Loop for more…
We have looked at one repository, for more obviously we need to go through the loop.
Python3
for repo_info in repositories: print (f "\n *** Repository Name: {repo_info['name']}" ) print (f " *** Repository Owner: {repo_info['owner']['login']}" ) print (f " *** Repository Description: {repo_info['description']}" ) |
Output:
Step 7: Visualization with Plotly
Time for visualization using the data we have now to show the popularity of JavaScript projects on Github. Digesting information visually is always helpful.
Before using you need to install Plotly package. For installation run this command into the terminal.
pip install plotly
Code:
Python3
# imports import requests from plotly.graph_objs import Bar from plotly import offline # siteurl and headers headers = { 'Accept' : 'application/vnd.github.v3+json' } # response and parsing the response. response = requests.get(site_url, headers = headers) response_json = response.json() repositories = response_json[ 'items' ] # loop the repositories repo_names, repo_stars = [], [] for repo_info in repositories: repo_names.append(repo_info[ 'name' ]) repo_stars.append(repo_info[ 'stargazers_count' ]) # graph plotting data_plots = [{ 'type' : 'bar' , 'x' :repo_names , 'y' : repo_stars}] layout = { 'title' : 'GItHubs Most Popular Javascript Projects' , 'xaxis' : { 'title' : 'Repository' }, 'yaxis' : { 'title' : 'Stars' }} # saving graph to a Most_Popular_JavaScript_Repos.png fig = { 'data' : data_plots, 'layout' : layout} offline.plot(fig, image = 'png' , image_filename = 'Most_Popular_JavaScript_Repos' ) |
The above code when executed, will save the bar-chart to a png file – Most_Popular_JavaScript_Repos under the current repository.
Step 8: Creating a Presentation… Introduction…
Microsoft production especially Spreadsheets and PowerPoint presentations are ruling the world. So we are going to create a PowerPoint presentation with the Visualization graph we just created.
For installing python-pptx run this code into the terminal:
pip install python-pptx
We will begin by creating our first slide with the title — ” Popular JavaScript Repositories in Github”.
Python3
from pptx import Presentation # create an object ppt ppt = Presentation() # add a new slide slide = ppt.slides.add_slide(ppt.slide_layouts[ 0 ]) # Set the Text to slide.shapes.title.text = "Popular JavaScript Repositories in GitHub" # save the powerpoint ppt.save( 'Javascript_report.pptx' ) |
Output:
We have first imported Presentation from ppt then create a ppt object using the Presentation class of ppt module. New slide is added with add_slide() method. The text is added using the slide.shapes.
Step 9: Saving the chart to pptx.
Now that the basics of creating a PowerPoint are covered in the above steps. Now let’s dive into the final piece of code to create a report.
Python3
from pptx import Presentation from pptx.util import Inches from datetime import date # create an Object ppt = Presentation() first_slide = ppt.slides.add_slide(ppt.slide_layouts[ 0 ]) # title (included date) title = "Popular JavaScript Repositories in GitHub - " + str (date.today()) # set the title on first slide first_slide.shapes[ 0 ].text_frame.paragraphs[ 0 ].text = title # slide 2 - set the image img = 'Most_Popular_JavaScript_Repos.png' second_slide = ppt.slide_layouts[ 1 ] slide2 = ppt.slides.add_slide(second_slide) # play with the image attributes if you are not OK with the height and width pic = slide2.shapes.add_picture(img, left = Inches( 2 ),top = Inches( 1 ),height = Inches( 5 )) # save the powerpoint presentation ppt.save( 'Javascript_report.pptx' ) |
Output:
Finally, we will put all the above steps discussed in a single program.
Python3
import requests from plotly.graph_objs import Bar from plotly import offline from pptx import Presentation from pptx.util import Inches from datetime import date def github_api(): # siteurl and headers headers = { 'Accept' : 'application/vnd.github.v3+json' } # response and parsing the response. response = requests.get(site_url, headers = headers) response_json = response.json() repositories = response_json[ 'items' ] # loop the repositories repo_names, repo_stars = [], [] for repo_info in repositories: repo_names.append(repo_info[ 'name' ]) repo_stars.append(repo_info[ 'stargazers_count' ]) # graph plotting data_plots = [{ 'type' : 'bar' , 'x' :repo_names , 'y' : repo_stars}] layout = { 'title' : 'GItHubs Most Popular Javascript Projects' , 'xaxis' : { 'title' : 'Repository' }, 'yaxis' : { 'title' : 'Stars' }} # saving graph to a Most_Popular_JavaScript_Repos.png fig = { 'data' : data_plots, 'layout' : layout} offline.plot(fig, image = 'png' , image_filename = 'Most_Popular_JavaScript_Repos' ) def create_pptx_report(): # create an Object ppt = Presentation() first_slide = ppt.slides.add_slide(ppt.slide_layouts[ 0 ]) # title (included date) title = "Popular JavaScript Repositories in GitHub - " + str (date.today()) # set the title on first slide first_slide.shapes[ 0 ].text_frame.paragraphs[ 0 ].text = title # slide 2 - set the image img = 'Most_Popular_JavaScript_Repos.png' second_slide = ppt.slide_layouts[ 1 ] slide2 = ppt.slides.add_slide(second_slide) # play with the image attributes if you are not OK with the height and width pic = slide2.shapes.add_picture(img, left = Inches( 2 ),top = Inches( 1 ),height = Inches( 5 )) # save the powerpoint presentation ppt.save( 'Javascript_report.pptx' ) if __name__ = = '__main__' : github_api() create_pptx_report() |