iNLTK: Natural Language Toolkit for Indic Languages in Python

26 July 2024

2

We all are aware of the popular NLP library NLTK (Natural Language Tool Kit), which is used to perform diverse NLP tasks and operations. NLTK, however, is limited to dealing with English Language only. In this article, we will explore and discuss iNLTK, which is Natural Language Tool Kit for Indic Languages. As the name suggests, iNLTK is a Python library that is used to perform NLP operations in Indian languages.

Languages Available in iNLTK

iNLTK covers almost all of the most common Indian languages. Following is the list of languages along with their codes available in iNLTK:

Language	Code
Hindi	hi
Punjabi	pa
Sanskrit	sa
Gujarati	gu
Kannada	kn
Malayalam	ml
Nepali	ne
Odia	or
Marathi	mr
Bengali	bn
Tamil	ta
Urdu	ur

Installation

iNLTK can be easily installed using pip as follows:

!pip install inltk

iNLTK has Pytorch 1.3.1 as a dependency which can be installed as follows:

For terminal: pip install torch==1.3.1+cpu -f https://download.pytorch.org/whl/torch_stable.html

For jupiter: !pip install torch==1.3.1 -f https://download.pytorch.org/whl/torch_stable.html

Import and Initial Setup

When using a language for the first time in a system/environment, you need to set the language setup which downloads the models corresponding to the language. However, this is required only when using the language for the first time. Subsequently, no setup is required. we have set the language as Bengali (bn). You can set it up for any language of your choice from the list of available languages. It is reiterated that the setup is only a one-time job. You can set up a language as follows:

Python3

# Setting up Hindi Language
from inltk.inltk import setup 
setup('hi') 
setup('bn')
 
# to run on google colab
# !python -c """from inltk.inltk import
# setup;setup('hi');setup('bn')"""

Performing basic NLP tasks using iNLTK

Now, let us perform some of the basic NLP tasks in Indian Languages using iNLTK. The tasks that we will be performing are as follows:

Tokenization
Text Embedding Generation
Next Word Prediction
Similar Sentence Generation
Checking Sentence Similarity

Tokenization

Tokenization refers to breaking a sentence into smaller units. This is one of the imperative steps when it comes to text pre-processing. For this iNLTK offers a function called tokenize(text, language code) which takes input text and its language code as the arguments.

Example:

We tokenize the sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘GeeksForGeeks is a great technology learning platform.’)

Python3

from inltk.inltk import tokenize
 
text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग \
प्लेटफॉर्म है।'
tokenize(text ,'hi')

Output:

['▁गी',
'क्स',
'▁फॉर',
'▁गी',
'क्स',
'▁एक',
'▁बेहतरीन',
'▁टेक्नोलॉजी',
'▁ल',
'र्न',
'िंग',
'▁प्लेटफॉर्म',
'▁है',
'।']

Hence, we have tokenized a sentence using iNLTK.

Text Embedding Generation

In NLP, text embeddings refer to a vectorized representation of text. It is necessary to convert text to embeddings as we cannot feed Machine/Deep Learning models with the raw text directly. This can be done using iNLTK’s get_embedding_vectors(text, language code) which takes input text and its language code as the arguments.

Example:

We generate text embeddings for the same sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘GeeksForGeeks is a great technology learning platform.’)

Python3

from inltk.inltk import get_embedding_vectors
from warnings import filterwarnings
from IPython.display import display
filterwarnings("ignore")
 
text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।'
vectors = get_embedding_vectors(text, 'hi')
display(vectors)

Output:

[array([-0.737411, 0.203377, 0.005537, -0.468718, …, 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),

array([-0.012183, -0.036214, -0.412297, -0.546257, …, 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),

array([ 0.021317, -0.130494, -0.248163, -0.203298, …, 0.064852, 0.230874, -0.315259, 0.368123], dtype=float32),

array([-0.737411, 0.203377, 0.005537, -0.468718, …, 0.110487, 0.325836, 0.64981 , 0.463476], dtype=float32),

array([-0.012183, -0.036214, -0.412297, -0.546257, …, 0.094262, 0.0921 , 1.359242, -0.505965], dtype=float32),

array([ 0.526271, -0.111786, 0.024964, -0.413432, …, -0.269101, 0.14501 , 0.139528, 0.036384], dtype=float32),

array([ 0.231323, -0.129719, -0.120698, -0.229107, …, -0.207799, -0.144117, 1.09991 , 0.544219], dtype=float32),

array([ 0.408419, 0.320988, -0.380744, -0.563505, …, -0.254394, -0.200471, 0.201553, -0.074097], dtype=float32),

array([-0.307099, -0.186613, 0.040754, -0.271758, …, 0.477781, 0.759681, 0.485825, 0.222599], dtype=float32),

array([-0.0195 , -0.056414, 0.155854, -0.955072, …, 0.127837, -0.161846, 0.381132, -0.233802], dtype=float32),

array([-0.063136, -0.16291 , -0.412124, -0.580033, …, -0.468475, 0.246613, 0.661614, 0.354779], dtype=float32),

array([-0.182706, -0.237699, 0.478908, -0.567147, …, 0.694749, 0.526647, 0.650397, 0.172727], dtype=float32),

array([-0.183833, -0.005238, -0.187345, -0.113823, …, 0.062584, -1.36463 , 0.665604, -1.425032], dtype=float32),

array([ 0.792413, 0.01189 , -0.71231 , -0.313467, …, 0.190676, 0.938687, 0.464781, 0.195361], dtype=float32)]

Thus, we have generated embeddings for Hindi text using iNLTK.

Next Word Prediction

Here we are giving some initial words, and we try to predict the subsequent words based on them. iNLTK provides a function predict_next_words(text, n, language_code) which takes the input text, a number of words to be predicted (n), and language code as the arguments.

Example:

Predict the next words for the phrase ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी’ (which is Hindi translation for ‘GeeksForGeeks is a great technology)

Python3

from inltk.inltk import predict_next_words
 
text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी'
n=3
predict_next_words(text , n, 'hi') 

Output:

Here, we have predicted the next 4 words for a given phrase in Hindi.

Note The Output May change every time when you run the command.

गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी वाला ग्रन्थ है|

Similar Sentence Generation

One of the most common tasks of NLP is to generate similar sentences to a given input sentence. iNLTK’s get_similar_sentences(text, n, language_code) does exactly the same. It takes the input text, the number of sentences to be generated (n), and the language code as the arguments.

Example:

We generate similar sentences for the sentence ‘गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।’ (which is Hindi translation for ‘GeeksForGeeks is a great technology learning platform.’)

Python3

from inltk.inltk import get_similar_sentences
 
text = 'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।'
n=5
get_similar_sentences(text, n, 'hi')

Output:

['गीक्स फॉर गीक्स एक सर्वोत्कृष्ट टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।',
'गीक्स फॉर टेलिफोनक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।',
'गीक्स फॉर दुष्यन्तक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।',
'तम्बूक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग प्लेटफॉर्म है।',
'गीक्स फॉर गीक्स एक बेहतरीन टेक्नोलॉजी लर्निंग स्कीम है।']

Checking Sentence Similarity

We can also check the similarities between two sentences using iNLTK. This can be done using iNLTK’s get_sentence_similarity(text1, text2, language_code) function, which takes two text inputs that are to be compared and the language code as the arguments.

Example:

we check the similarity between the sentences ‘Geeks For Geeks হল একটি দুর্দান্ত প্রযুক্তি শেখার প্ল্যাটফর্ম।’ and ‘Geeks For Geeks হল একটি দুর্দান্ত কম্পিউটার বিজ্ঞান শেখার প্ল্যাটফর্ম।’ (which are Bengali translations for ‘GeeksForGeeks is a great technology learning platform’ and ‘Geeks For Geeks is an awesome computer science learning platform.’ respectively.

Python3

from inltk.inltk import get_sentence_similarity
 
text1 = 'Geeks For Geeks হল একটি দুর্দান্ত প্রযুক্তি শেখার প্ল্যাটফর্ম।'
text2 = 'Geeks For Geeks হল একটি দুর্দান্ত কম্পিউটার বিজ্ঞান শেখার প্ল্যাটফর্ম।'
get_sentence_similarity(text1, text2, 'bn')

Output:

We can see that the similarity score of the two sentences is quite high, as expected.

0.8634665608406067

iNLTK: Natural Language Toolkit for Indic Languages in Python

Languages Available in iNLTK

Installation

Import and Initial Setup

Python3

Performing basic NLP tasks using iNLTK

Tokenization

Example:

Python3

Text Embedding Generation

Example:

Python3

Next Word Prediction

Example:

Python3

Similar Sentence Generation

Example:

Python3

Checking Sentence Similarity

Example:

Python3

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US