Newspaper scraping using Python and News API

27 July 2024

1

There are mainly two ways to extract data from a website:

Use the API of the website (if it exists). For example, Facebook has the Facebook Graph API which allows retrieval of data posted on Facebook.
Access the HTML of the webpage and extract useful information/data from it. This technique is called web scraping or web harvesting or web data extraction.

In this article, we will be using the API of newsapi. You can create your own API key by clicking here. Examples: Let’s determine the concern of a personality like states president cited by newspapers, let’s take the case of MERKEL

Python3

import pprint
import requests
 
 
secret = "Your API"
  
# Define the endpoint
url = 'https://newsapi.org/v2/everything?'
  
# Specify the query and
# number of returns
parameters = {
    'q': 'merkel', # query phrase
    'pageSize': 100,  # maximum is 100
    'apiKey': secret # your own API key
}
  
# Make the request
response = requests.get(url, 
                        params = parameters)
  
# Convert the response to 
# JSON format and pretty print it
response_json = response.json()
pprint.pprint(response_json)

Output: Let’s combine all texts and sort the words from the greatest number to lower.

Python3

from wordcloud import WordCloud
import matplotlib.pyplot as plt
 
 
text_combined = ''
 
for i in response_json['articles']:
     
    if i['description'] != None:
        text_combined += i['description'] + ' '
         
wordcount={}
for word in text_combined.split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
 
for k,v, in sorted(wordcount.items(),
                   key=lambda words: words[1], 
                   reverse = True):
    print(k,v)

Output: This evaluation is ambiguous, we can make it more clear if we delete bad or useless words. Let’s define some of bad_words shown below

bad_words = [“a”, “the”, “of”, “in”, “to”, “and”, “on”, “de”, “with”, “by”, “at”, “dans”, “ont”, “été”, “les”, “des”, “au”, “et”, “après”, “avec”, “qui”, “par”, “leurs”, “ils”, “a”, “pour”, “les”, “on”, “as”, “france”, “eux”, “où”, “son”, “le”, “la”, “en”, “with”, “is”, “has”, “for”, “that”, “an”, “but”, “be”, “are”, “du”, “it”, “à”, “had”, “ist”, “Der”, “um”, “zu”, “den”, “der”, “-“, “und”, “für”, “Die”, “von”, “als”, “sich”, “nicht”, “nach”, “auch” ]

Now we can delete and format the text by deleting bad words

Python3

# initializing bad_chars_list 
bad_words = ["a", "the" , "of", "in", "to", "and", "on", "de", "with", 
             "by", "at", "dans", "ont", "été", "les", "des", "au", "et", 
             "après", "avec", "qui", "par", "leurs", "ils", "a", "pour", 
             "les", "on", "as", "france", "eux", "où", "son", "le", "la",
             "en", "with", "is", "has", "for", "that", "an", "but", "be", 
             "are", "du", "it", "à", "had", "ist", "Der", "um", "zu", "den", 
             "der", "-", "und", "für", "Die", "von", "als",
             "sich", "nicht", "nach", "auch"  ] 
 
 
r = text_combined.replace('\s+',
                          ' ').replace(',', 
                                       ' ').replace('.',
                                                    ' ')
words = r.split()
rst = [word for word in words if
       ( word.lower() not in bad_words 
        and len(word) > 3) ]
 
rst = ' '.join(rst)
  
wordcount={}
 
for word in rst.split():
     
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
  
for k,v, in sorted(wordcount.items(),
                   key=lambda words: words[1],
                   reverse = True):
    print(k,v)

Output: Let’s plot the output

Python3

word = WordCloud(max_font_size = 40).generate(rst)
plt.figure()
plt.imshow(word, interpolation ="bilinear")
plt.axis("off")
plt.show()

Output: As you see in the descriptions of articles, the most dominant concern with Merkel is his defense minister Kramp-Karrenbauer, Kanzlerin just means female chancellor. We can do the same work using titles only

Python3

title_combined = ''
 
for i in response_json['articles']:
    title_combined += i['title'] + ' '
     
titles = title_combined.replace('\s+',
                                ' ').replace(',',
                                             ' ').replace('.',
                                                          ' ')
words_t = titles.split()
result = [word for word in words_t if
          ( word.lower() not in bad_words and
           len(word) > 3) ]
 
result = ' '.join(result)
  
wordcount={}
 
for word in result.split():
     
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
 
word = WordCloud(max_font_size=40).generate(result)
plt.figure()
plt.imshow(word, interpolation="bilinear")
plt.axis("off")
plt.show()

Output: From titles, we found out that the most concern with Merkel is Ardogan, turkey president.

Newspaper scraping using Python and News API

Python3

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

This is my surprise phone of the year [Video]

Recent Comments

EDITOR PICKS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR POSTS

Sticky Password vs. LastPass 2024: Which Is Better? by Katarina Glamoslija

Galaxy S25 on-device AI capability expands, reducing reliance on the cloud

OnePlus 13R launches with a huge battery upgrade, starting in China

POPULAR CATEGORY

ABOUT US

FOLLOW US