Top K Nearest Words using Edit Distances in NLP

20 July 2024

1

We can find the Top K nearest matching words to the given query input word by using the concept of Edit/Levenstein distance. If any word is the same word as the query input(word) then their Edit distance would be zero and is a perfect match and so on. So we can find the Edit distances between the query word and the words present in our vocabulary pairwise and then sort them according to their edit distances. This will help us arrange the words in the vocabulary in the order of their nearest match to the query. This is very useful to solve Information Retrieval tasks.

Code implementations

If we have a vocabulary of words, then we can find the K nearest matching words to the query with the help of Edit distance. Here Vocabulary can be a List of Strings.

Steps:

Create a function to find the edit distance between two words.
Create a second function to find the k nearest word. It uses the edit distance function to measure the distance between two words.
For the vocabulary, we are using some text and uses text.split() command to store as the list of words.
In the k nearest word-defined function input the query or word, vocabulary, and the number of the nearest word we want to find.

Python3

# code for Edit Distance 
def edit_Distance(s1, s2): 
    r, c = len(s1), len(s2) 
    dp = [[-1 for j in range(0, c+1)]for i in range(0, r+1)] 
    for i in range(0, r+1): 
        dp[i][0] = i 
    for j in range(0, c+1): 
        dp[0][j] = j 
    for i in range(1, r+1): 
        for j in range(1, c+1): 
            if(s1[i-1] == s2[j-1]): 
                dp[i][j] = dp[i-1][j-1] 
            else: 
                dp[i][j] = 1+min(dp[i-1][j], dp[i][j-1], dp[i-1][j-1]) 
    return dp[r] 
  
  
# code for finding the Top k matching words 
# here s1 is the query input 
def k_nearest_word(s1, vocabulary, k): 
    k_nearest_word = [] 
    word_dist = {} 
    for word in vocabulary: 
        word_dist[word] = edit_Distance(s1, word) 
    word_dist = dict(sorted(word_dist.items(), key=lambda item: item[1])) 
    ans = [] 
    li = list(word_dist.keys()) 
    for i in range(0, k): 
        ans.append(li[i]) 
    return ans 
  
  
# for Testing purposes 
text = ''' 
Pytorch is an open-source deep learning framework available with a Python and C++ interface.  
it resides inside the torch module.  
the data that has to be processed is input in the form of a tensor. 
that allows you to develop and train deep learning models.  
However, as the size and complexity of your models grow,  
the time it takes to train them can become prohibitive. 
  
'''
# vocabulary=['This','is','salt','saw','Bat','salty','salts'] 
vocabulary = text.split() 
query = 'Pytorch'
k = 3  # top k nearest matching 
li = k_nearest_word(query, vocabulary, k) 
print(li) 

Output:

['Pytorch', 'torch', 'Python']

Using Levenshtein library

Steps:

Import the Levenshtein library. We will use it to find the distance between the two strings.
Create a second function to find the k nearest word. It uses the Levenshtein.distance function to measure the distance between two words.
For the vocabulary, we are using some text and uses text.split() command to store as the list of words.
In the k nearest word-defined function input the query or word, vocabulary, and the number of the nearest word we want to find.

Python3

import Levenshtein 
  
def find_top_k_nearest_words(query_input, words_list, k): 
    distances = [] 
    for word in words_list: 
        distance = Levenshtein.distance(query_input, word) 
        distances.append((word, distance)) 
    sorted_distances = sorted(distances, key=lambda x: x[1]) 
    return [d[0] for d in sorted_distances[:k]] 
  
nearest_words = find_top_k_nearest_words(query, vocabulary, k) 
print(nearest_words)

Output:

['Pytorch', 'torch', 'Python']

Top K Nearest Words using Edit Distances in NLP

Code implementations

Steps:

Python3

Using Levenshtein library

Steps:

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

Google wants to hear your thoughts on the Android 15 QPR2 Beta

Recent Comments

EDITOR PICKS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR POSTS

How to Secure Your Network-Attached Storage (NAS) in 2024 by Tyler Cross

8 Best Private Search Engines in 2024: Tested by Experts by Tyler Cross

The biggest comeback in tech history [Video]

POPULAR CATEGORY

ABOUT US

FOLLOW US