Friday, January 17, 2025
Google search engine
HomeLanguagesScrape Google Ngram Viewer using Python

Scrape Google Ngram Viewer using Python

In this article, we will learn how to scrape Google Ngarm using Python. Google Ngram/Google Books Ngram Viewer is an online search engine that charts the frequencies of any set of search strings.

What is Google Ngram Viewer?

The Google Ngram Viewer is a search engine used to determine the popularity of a word or a phrase in books. Google ngram viewer gives us various filter options, including selecting the language/genre of the books (also called corpus) and the range of years in which the books were published. By default, the search is case-sensitive.  

What Does a Google Ngram Search Result Look Like?

If we search for “Albert Einstein” in Google Ngram, the search result will look like this.

 

A phrase having only one word (say “geek”), the phrase is called a unigram. Similarly, a phrase containing two words (say “Isaac Newton”) is called a bigram. If you search for a bigram phrase in google ngram, it will show you this: Of all the bigrams present in books, what percentage of them contained the phrase you searched for.

We can even compare the popularity of different phrases in the same search result by separating them with commas. For example, we can compare the popularity of “Albert Einstein” vs “Isaac Newton” from the years 1850 to 1900 across different books written in the English language.

Working With the Google Ngram URLs

If we search for “Albert Einstein” in google ngram with the years ranging from 1850 to 1860, corpus being English, and 0 smoothing, we will see a graph as shown in the image above. The URL of this search query will look like this.

https://books.google.com/ngrams/graph?content=Albert%20Einstein&year_start=1850&year_end=1860&corpus=26&smoothing=0

In the above URL, if we replace the word graph with the word json, we will get the JSON data of our search query instead of the graph. The new URL will look like this.

https://books.google.com/ngrams/json?content=Albert%20Einstein&year_start=1850&year_end=1860&corpus=26&smoothing=0

The search result of this URL will look like this:

 

We can extract this JSON data using Python.

How to Scrape Google Ngrams?

To scrape google ngram, we will use Python’s requests and urllib libraries.

Now, we will create a function that extracts the data from google ngram’s website. Go through the comments written along with the code in order to follow along. 

Python3




import requests
import urllib
  
def runQuery(query, start_year=1850
             end_year=1860, corpus=26,
             smoothing=0):
  
    # converting a regular string to 
    # the standard URL format
    # eg: "neveropen for,neveropen" will
    # convert to "neveropen%20for%2Cneveropen"
    query = urllib.parse.quote(query)
  
    # creating the URL
    '&year_start=' + str(start_year) + '&year_end=' +
    str(end_year) + '&corpus=' + str(corpus) + '&smoothing=' +
    str(smoothing) + ''
  
    # requesting data from the above url
    response = requests.get(url)
  
    # extracting the json data from the response we got
    output = response.json()
  
    # creating a list to store the ngram data
    return_data = []
  
    if len(output) == 0:
        # if no data returned from site,
        # print the following statement
        return "No data available for this Ngram."
    else:
        # if data returned from site,
        # store the data in return_data list
        for num in range(len(output)):
            
              # getting the name
            return_data.append((output[num]['ngram'], 
                                  
                                # getting ngram data
                                output[num]['timeseries']) 
                               )
  
    return return_data


In the function runQuery, we took an argument string query as the function’s argument while the rest of the arguments were default arguments. By default, the year range was kept 1850 to 1860, the corpus was 26 (i.e. English language), and the smoothing was kept 0. We created the google ngram URL as per the argument string. Then, we used this URL to get the data from google ngram. Once the JSON data was returned, we stored the data we needed in a list and then returned the list.

Now, let us use the runQuery function to find out the popularity of “Albert Einstein”.

Python3




query = "Albert Einstein"
  
print(runQuery(query))


Output:

[(‘Albert Einstein’, [0.0, 0.0, 0.0, 0.0, 2.171790969285325e-09, 

1.014315520464492e-09, 6.44787723214079e-10, 0.0, 7.01216085197131e-10, 0.0, 0.0])]

We can even enter multiple phrases in the same query by separating each phrase with commas.

Python3




query = "Albert Einstein,Isaac Newton"
  
print(runQuery(query))


Output:

[(‘Albert Einstein’, [0.0, 0.0, 0.0, 0.0, 2.171790969285325e-09, 

1.014315520464492e-09, 6.44787723214079e-10, 0.0, 7.01216085197131e-10, 

0.0, 0.0]), (‘Isaac Newton’, [1.568728407619346e-06, 1.135979687205690e-06, 

1.140318772741011e-06, 1.102130454455618e-06, 1.34806168716750e-06, 

2.039112359852879e-06, 1.356955749542976e-06, 1.121004174819972e-06, 

1.223622120960499e-06, 1.18965874662535e-06, 1.077695060303085e-06])]

Dominic Rubhabha-Wardslaus
Dominic Rubhabha-Wardslaushttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

Most Popular

Recent Comments