Project Idea – Searching news from Old Newspaper using NLP

23 July 2024

2

We know that the newspaper is an enriched source of knowledge. When a person needs some information about a particular topic or subject he searches online, but it is difficult to get all old news articles from regional local newspapers related to our search. As not every local newspaper provides an online search for people.In this article, we will present an idea to overcome this problem.

What project does?

This project uses images or pdf of newspaper images from old regional newspapers as input for the database.
The model will extract the text from images using Pytesseract.
The text from the Pytesseract would be cleaned by NLP practices to simplify and eliminate the words (stop words) that are not helpful for us.
The data will be saved in the form of key-value pair in which keys have an image path and values have keywords in the image.
Searching: When the user visits the website he will type the topic name or entity name in the search box then images of the newspaper will load on the screen.

Why NLP ?

Newspaper articles contain many articles, prepositions, and other stop words that are not useful to us, so NLP helps us to remove those stop words. It also helps to get unique words.

Technologies used :

NLTK
Python

Tools used :

Google colab

Libraries used:

pytesseract: image to text.
NLTK: text pre-processing, filtering.
pandas: storing dataframe.

Use Case Diagram

Step By step Implementation:

Libraries installation

First, Install required libraries on colab.

Python3

!pip install nltk 
!pip install pytesseract 
  
!sudo apt install tesseract-ocr 
  
# to check if it installed properly 
# !which tesseract 
# pytesseract.pytesseract.tesseract_cmd = ( 
#     r'/usr/bin/tesseract' 
# )

Let’s import all the necessary libraries:

Python3

import io 
import glob 
import os 
from PIL import Image 
import cv2 
import pytesseract  
# /usr/bin/tesseract 
import pandas as pd 
import nltk 
nltk.download('popular') 
nltk.download('stopwords') 
nltk.download('wordnet') 
from nltk.tokenize import RegexpTokenizer 
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer 
from IPython.display import Image 
from google.colab.patches import cv2_imshow 

pre function

This will clean the text to get important names, keywords, etc. Stop words and duplicate words are removed by the below function.

Python3

def pre(text): 
    text = text.lower() 
    tokenizer = RegexpTokenizer(r'\w+') 
    new_words = tokenizer.tokenize(text) 
    stop_words = list(stopwords.words("english")) 
    filtered_words = [] 
      
    for w in new_words: 
        if w not in stop_words: 
            filtered_words.append(w) 
    unique = [] 
  
    for w in filtered_words: 
        if w not in unique: 
            unique.append(w) 
    res = ' '.join([str(elem) for elem in unique]) 
    res = res.lower() 
      
    return res 

to_df function

when given image path as a parameter it returns preprocessed text in the text variable. then this text is passed as a parameter to pre(). this function returns a dictionary with filename and important text.

Python

def to_df(imgno): 
  text = pytesseract.image_to_string(imgno) 
  out = pre(text) 
  data = {'filename':imgno, 
          'text':out} 
  return data

Driver code

here we are defining the dataframe to store the dictionary which has an image path and the text inside the image. We will use this dataframe for searching.

Python3

i=0
dff=pd.DataFrame()

Listing all images in the content folder.

Python3

images = [] 
folder = "/content/"
  
for filename in os.listdir(folder): 
    img = cv2.imread(os.path.join(folder, filename)) 
      
    if img is not None: 
        print(filename) 
        images.append(filename) 

getting all images

For loop to get all news images from the folder.

Python3

for u in images: 
  i += 1
  data = to_df(u) 
  dff = dff.append(pd.DataFrame(data, index=[i])) 
  
print(dff) 

dataframe

Processing the images

Python3

# sample text output after processing image 
dff.iloc[0]['text']

Saving the dataframe to database.

sample text after preprocessing

Saving the dataframe

Python3

# saving the dataframe 
dff.to_csv('save newsdf.csv') 

saved Dataframe

Searching

Open the dataframe file from storage.

Python3

data = pd.read_csv('/content/save newsdf.csv') 
data

open dataframe from storage

We provide a string as input for the function to get an image in which the keyword is present.

Python3

txt= 'modi'
index= data['text'].str.find(txt ) 
index 

the non -1 row th images contain word ‘modi’

Showing the result

Python3

#  we are showing the first result here 
for i in range(len(index)): 
  
    if (index[i] != -1): 
        a.append(i) 
  
try: 
    res = data.iloc[a[0]]['filename'] 
except: 
    print("no file") 
      
Image(res)

Result of the project

We have searched for the word ‘modi‘. The first newspaper which has our searched word in it so it’s shown here.

Scope for Improvement

We could use a dedicated database, like lucent or elastic search to make the search more efficient and fast. But for the time being, we use the pandas library to get the path of the image to display to the user.

Project Application in Real-Life

Voter Helper: As elections are coming and Aman is a voter who doesn’t know that much about the politician whom he’s going to vote for. In this situation, he opens our local news.search.in, then he searches for the politician. The website will show the no of the article from the national newspapers as well as from the regional local newspapers, related to the searched person. Now he is ready to decide his vote.
Student Research made easy: It can be useful for students who are researching for the topic, as this will give every article from all newspapers related to their topic in image form so they can make notes out of it.
Search Engine for Newspaper company: It can be used for press companies to have a search feature on their website.

Project Idea – Searching news from Old Newspaper using NLP

What project does?

Why NLP ?

Technologies used :

Tools used :

Libraries used:

Use Case Diagram

Step By step Implementation:

Libraries installation

Python3

Python3

pre function

Python3

to_df function

Python

Driver code

Python3

Python3

Python3

Python3

Python3

Searching

Python3

Python3

Python3

Scope for Improvement

Project Application in Real-Life

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US