In this article, we are going to see Pre-trained Word embedding using Glove in NLP models using Python.
Word embedding
In NLP models, we deal with texts which are human-readable and understandable. But the machine doesn’t understand texts, it only understands numbers. Thus, word embedding is the technique to convert each word into an equivalent float vector. Various techniques exist depending upon the use-case of the model and dataset. Some of the techniques are One Hot Encoding, TF-IDF, Word2Vec and FastText.
Example:
'the': [-0.123, 0.353, 0.652, -0.232] 'the' is very often used word in texts of any kind. its equivalent 4-dimension dense vector has been given.
Glove Data
It stands for Global Vectors. This is created by Stanford University. Glove has pre-defined dense vectors for around every 6 billion words of English literature along with many other general use characters like comma, braces, and semicolons.
There are 4 varieties available in glove:
Four varieties are: 50d, 100d, 200d and 300d.
Here d stands for dimension. 100d means, in this file each word has an equivalent vector of size 100. Glove files are simple text files in the form of a dictionary. Words are key and dense vectors are values of key.
Create Vocabulary Dictionary
Vocabulary is the collection of all unique words present in the training dataset. The first dataset is tokenized into words, then all the frequency of each word is counted. Then words are sorted in decreasing order of their frequencies. Words having high frequency are placed at the beginning of the dictionary.
Dataset= {The peon is ringing the bell} Vocabulary= {'The':2, 'peon':1, 'is':1, 'ringing':1}
Algorithm for word embedding:
- Preprocess the text data.
- Created the dictionary.
- Traverse the glove file of a specific dimension and compare each word with all words in the dictionary,
- if a match occurs, copy the equivalent vector from the glove and paste into embedding_matrix at the corresponding index.
Below is the implementation:
Python3
# code for Glove word embedding from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequences import numpy as np x = { 'text' , 'the' , 'leader' , 'prime' , 'natural' , 'language' } # create the dict. tokenizer = Tokenizer() tokenizer.fit_on_texts(x) # number of unique words in dict. print ( "Number of unique words in dictionary=" , len (tokenizer.word_index)) print ( "Dictionary is = " , tokenizer.word_index) # download glove and unzip it in Notebook. #!unzip glove*.zip # vocab: 'the': 1, mapping of words with # integers in seq. 1,2,3.. # embedding: 1->dense vector def embedding_for_vocab(filepath, word_index, embedding_dim): vocab_size = len (word_index) + 1 # Adding again 1 because of reserved 0 index embedding_matrix_vocab = np.zeros((vocab_size, embedding_dim)) with open (filepath, encoding = "utf8" ) as f: for line in f: word, * vector = line.split() if word in word_index: idx = word_index[word] embedding_matrix_vocab[idx] = np.array( vector, dtype = np.float32)[:embedding_dim] return embedding_matrix_vocab # matrix for vocab: word_index embedding_dim = 50 embedding_matrix_vocab = embedding_for_vocab( '../glove.6B.50d.txt' , tokenizer.word_index, embedding_dim) print ( "Dense vector for first word is => " , embedding_matrix_vocab[ 1 ]) |
Output: