Saturday, November 23, 2024
Google search engine
HomeLanguagesPre-trained Word embedding using Glove in NLP models

Pre-trained Word embedding using Glove in NLP models

In this article, we are going to see Pre-trained Word embedding using Glove in NLP models using Python.

Word embedding

In NLP models, we deal with texts which are human-readable and understandable. But the machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending upon the use-case of the model and dataset. Some of the techniques are One Hot Encoding, TF-IDF, Word2Vec and FastText.

Example: 

'the': [-0.123, 0.353, 0.652, -0.232]
'the' is very often used word in texts of any kind. 
its equivalent 4-dimension dense vector has been given.

Glove Data

It stands for Global Vectors. This is created by Stanford University. Glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general use characters like comma, braces, and semicolons. 

There are 4 varieties available in glove:

Four varieties are: 50d, 100d, 200d and 300d. 

Here d stands for dimension. 100d means, in this file each word has an equivalent vector of size 100. Glove files are simple text files in the form of a dictionary. Words are key and dense vectors are values of key.

Create Vocabulary Dictionary

Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words having high frequency are placed at the beginning of the dictionary.

Dataset= {The peon is ringing the bell}
Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}

Algorithm for word embedding:

  • Preprocess the text data.
  • Created the dictionary.
  • Traverse the glove file of a specific dimension and compare each word with all words in the dictionary,
  • if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at the corresponding index.

Below is the implementation:

Python3




# code for Glove word embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
  
x = {'text', 'the', 'leader', 'prime',
     'natural', 'language'}
  
# create the dict.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
  
# number of unique words in dict.
print("Number of unique words in dictionary="
      len(tokenizer.word_index))
print("Dictionary is = ", tokenizer.word_index)
  
# download glove and unzip it in Notebook.
#!unzip glove*.zip
  
# vocab: 'the': 1, mapping of words with
# integers in seq. 1,2,3..
# embedding: 1->dense vector
def embedding_for_vocab(filepath, word_index,
                        embedding_dim):
    vocab_size = len(word_index) + 1
      
    # Adding again 1 because of reserved 0 index
    embedding_matrix_vocab = np.zeros((vocab_size,
                                       embedding_dim))
  
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]
  
    return embedding_matrix_vocab
  
  
# matrix for vocab: word_index
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
    '../glove.6B.50d.txt', tokenizer.word_index,
  embedding_dim)
  
print("Dense vector for first word is => ",
      embedding_matrix_vocab[1])


Output:

RELATED ARTICLES

Most Popular

Recent Comments