Pre-trained Word embedding using Glove in NLP models

27 July 2024

3

In this article, we are going to see Pre-trained Word embedding using Glove in NLP models using Python.

Word embedding

In NLP models, we deal with texts which are human-readable and understandable. But the machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending upon the use-case of the model and dataset. Some of the techniques are One Hot Encoding, TF-IDF, Word2Vec and FastText.

Example:

'the': [-0.123, 0.353, 0.652, -0.232]
'the' is very often used word in texts of any kind. 
its equivalent 4-dimension dense vector has been given.

Glove Data

It stands for Global Vectors. This is created by Stanford University. Glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general use characters like comma, braces, and semicolons.

There are 4 varieties available in glove:

Four varieties are: 50d, 100d, 200d and 300d.

Here d stands for dimension. 100d means, in this file each word has an equivalent vector of size 100. Glove files are simple text files in the form of a dictionary. Words are key and dense vectors are values of key.

Create Vocabulary Dictionary

Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words having high frequency are placed at the beginning of the dictionary.

Dataset= {The peon is ringing the bell}
Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}

Algorithm for word embedding:

Preprocess the text data.
Created the dictionary.
Traverse the glove file of a specific dimension and compare each word with all words in the dictionary,
if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at the corresponding index.

Below is the implementation:

Python3

# code for Glove word embedding
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
  
x = {'text', 'the', 'leader', 'prime',
     'natural', 'language'}
  
# create the dict.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x)
  
# number of unique words in dict.
print("Number of unique words in dictionary=", 
      len(tokenizer.word_index))
print("Dictionary is = ", tokenizer.word_index)
  
# download glove and unzip it in Notebook.
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip glove*.zip
  
# vocab: 'the': 1, mapping of words with
# integers in seq. 1,2,3..
# embedding: 1->dense vector
def embedding_for_vocab(filepath, word_index,
                        embedding_dim):
    vocab_size = len(word_index) + 1
      
    # Adding again 1 because of reserved 0 index
    embedding_matrix_vocab = np.zeros((vocab_size,
                                       embedding_dim))
  
    with open(filepath, encoding="utf8") as f:
        for line in f:
            word, *vector = line.split()
            if word in word_index:
                idx = word_index[word]
                embedding_matrix_vocab[idx] = np.array(
                    vector, dtype=np.float32)[:embedding_dim]
  
    return embedding_matrix_vocab
  
  
# matrix for vocab: word_index
embedding_dim = 50
embedding_matrix_vocab = embedding_for_vocab(
    '../glove.6B.50d.txt', tokenizer.word_index,
  embedding_dim)
  
print("Dense vector for first word is => ",
      embedding_matrix_vocab[1])

Output:

Pre-trained Word embedding using Glove in NLP models

Word embedding

Glove Data

Create Vocabulary Dictionary

Algorithm for word embedding:

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best Apps for Limiting Screen Time in 2025: Tested by Kristel van Hoof

The Evolution of Phishing Scams: Smarter, More Targeted, and Harder to Stop by Shipra Sanganeria

Securing the Cloud in Real Time: Inside Upwind’s Runtime-First Approach by

Inside BSides Kraków: Building a Hacker Culture from the Ground Up by

Recent Comments

EDITOR PICKS

5 Best Apps for Limiting Screen Time in 2025: Tested by Kristel van Hoof

The Evolution of Phishing Scams: Smarter, More Targeted, and Harder to Stop by Shipra Sanganeria

Securing the Cloud in Real Time: Inside Upwind’s Runtime-First Approach by

POPULAR POSTS

5 Best Apps for Limiting Screen Time in 2025: Tested by Kristel van Hoof

The Evolution of Phishing Scams: Smarter, More Targeted, and Harder to Stop by Shipra Sanganeria

Securing the Cloud in Real Time: Inside Upwind’s Runtime-First Approach by

POPULAR CATEGORY

ABOUT US

FOLLOW US