The GloVe model came out in 2014, a year after the Word2Vec paper came out. The GloVe and Word2Vec models are similar as the embeddings generated for a word are determined by the words that occur around it. However, these context words occur with different frequencies. Some of these context words appear more frequently in the text compared to other words. Due to this difference in frequencies of occurrence, training data for some words may be more common than other words.
This article is an excerpt from the book Advanced Natural Language Processing with TensorFlow 2 by Ashish Bansal – A one-stop solution for NLP practitioners, ML developers, and data scientists to build effective NLP systems that can perform real-world complicated tasks.
Beyond this part, Word2Vec does not use these statistics of co-occurrence in any way. GloVe takes these frequencies into account and posits that the co-occurrences provide vital information. The Global part of the name refers to the fact that the model considers these co-occurrences over the entire corpus. Rather than focus on the probabilities of co-occurrence, GloVe focuses on the ratios of co-occurrence considering probe words. In the paper, the authors take the example of the words ice and steam to illustrate the concept. Let’s say that solid is another word that is going to be used to probe the relationship between ice and steam.
A probability of occurrence of solid given steam is psolid|steam. Intuitively, we expect this probability to be small. Conversely, the probability of occurrence of solid with ice is represented by psolid|ice and is expected to be large. If Psolid|ice / Psolid|steam is computed, we expect this value to be significant. If the same ratio is computed with the probe word being gas, the opposite behavior would be expected. In cases where both are equally probable, either due to the probe word being unrelated, or equally probable to occur with the two words, then the ratio should be closer to 1. An example of a probe word close to both ice and steam is water. An example of a word unrelated to ice or steam is fashion. GloVe ensures that this relationship is factored into the embeddings generated for the words. It also has optimizations for rare co-occurrences, numerical stability issues computation, and others.
If pre-trained embeddings are used, we expect an increase in model accuracy. It would be interesting to try this out and see the impact of transfer learning on this model. Let us see first how to use these pre-trained embeddings for predicting sentiment. The first step is to load the data. All the code for this exercise is in the file imdb-transfer-learning.ipynb located in the chapter4-Xfer-learning-BERT directory in GitHub.
Loading IMDb training data
TensorFlow Datasets or the tfds package will be used to load the data:
import tensorflow as tf import tensorflow_datasets as tfds import numpy as np import pandas as pd imdb_train, ds_info = tfds.load(name="imdb_reviews", split="train", with_info=True, as_supervised=True) imdb_test = tfds.load(name="imdb_reviews", split="test", as_supervised=True)
Note that the additional 50,000 reviews that are unlabeled are ignored for the purpose of this exercise. After the training and test sets are loaded as shown above, the content of the reviews needs to be tokenized and encoded:
# Use the default tokenizer settings tokenizer = tfds.features.text.Tokenizer() vocabulary_set = set() MAX_TOKENS = 0 for example, label in imdb_train: some_tokens = tokenizer.tokenize(example.numpy()) if MAX_TOKENS < len(some_tokens): MAX_TOKENS = len(some_tokens) vocabulary_set.update(some_tokens)
The code shown above tokenizes the review text and constructs a vocabulary. This vocabulary is used to construct a tokenizer:
imdb_encoder = tfds.features.text.TokenTextEncoder(vocabulary_set, lowercase=True, tokenizer=tokenizer) vocab_size = imdb_encoder.vocab_size print(vocab_size, MAX_TOKENS)
93931 2525
Note that text was converted to lowercase before encoding. Converting to lowercase helps reduce the vocabulary size and may benefit the lookup of corresponding GloVe vectors later on. Note that capitalization may contain important information, which may help in tasks such as NER, which we covered in previous chapters. Also note that all languages do not distinguish between capital and small letters. Hence, this particular transformation should be applied after due consideration.
Now that the tokenizer is ready, the data needs to be tokenized, and sequences padded to a maximum length. The following convenience methods help in performing this task:
# transformation functions to be used with the dataset from tensorflow.keras.preprocessing import sequence def encode_pad_transform(sample): encoded = imdb_encoder.encode(sample.numpy()) pad = sequence.pad_sequences([encoded], padding='post', maxlen=150) return np.array(pad[0], dtype=np.int64) def encode_tf_fn(sample, label): encoded = tf.py_function(encode_pad_transform, inp=[sample], Tout=(tf.int64)) encoded.set_shape([None]) label.set_shape([]) return encoded, label
Finally, the data is encoded using the convenience functions above like so:
encoded_train = imdb_train.map(encode_tf_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE) encoded_test = imdb_test.map(encode_tf_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
At this point, all the training and test data is ready for training.
Note that in limiting the size of the reviews, only the first 150 tokens will be counted for a long review. Typically, the first few sentences of the review have the context or description, and the latter part of the review has the conclusion. By limiting to the first part of the review, valuable information could be lost. The reader is encouraged to try a different padding scheme where tokens from the first part of the review are dropped instead of the second part and observe the difference in the accuracy.
The next step is the foremost step in transfer learning – loading the pre-trained GloVe embeddings and using these as the weights of the embedding layer.
Loading pre-trained GloVe embeddings
First, the pre-trained embeddings need to be downloaded and unzipped:
# Download the GloVe embeddings !wget http://nlp.stanford.edu/data/glove.6B.zip !unzip glove.6B.zip Archive: glove.6B.zip inflating: glove.6B.50d.txt inflating: glove.6B.100d.txt inflating: glove.6B.200d.txt inflating: glove.6B.300d.txt
Note that this is a huge download of over 800 MB, so this step may take some time to execute. Upon unzipping, there will be four different files, as shown in the output above. Each file has a vocabulary of 400,000 words. The main difference is the dimensions of embeddings generated.
The nearest GloVe dimension is 50, so let’s use that. The file format is quite simple. Each line of the text has multiple values separated by spaces. The first item of each row is the word, and the rest of the items are the values of the vector for each dimension. So, in the 50-dimensional file, each row will have 51 columns. These vectors need to be loaded up in memory:
dict_w2v = {} with open('glove.6B.50d.txt', "r") as file: for line in file: tokens = line.split() word = tokens[0] vector = np.array(tokens[1:], dtype=np.float32) if vector.shape[0] == 50: dict_w2v[word] = vector else: print("There was an issue with " + word) # let's check the vocabulary size print("Dictionary Size: ", len(dict_w2v))
Dictionary Size: 400000
If the code processed the file correctly, you shouldn’t see any errors and you should see a dictionary size of 400,000 words. Once these vectors are loaded, an embedding matrix needs to be created.
Creating a pre-trained embedding matrix using GloVe
So far, we have a dataset, its vocabulary, and a dictionary of GloVe words and their corresponding vectors. However, there is no correlation between these two vocabularies. The way to connect them is through the creation of an embedding matrix. First, let’s initialize an embedding matrix of zeros:
embedding_dim = 50 embedding_matrix = np.zeros((imdb_encoder.vocab_size, embedding_dim))
Note that this is a crucial step. When a pre-trained word list is used, finding a vector for each word in the training/test is not guaranteed. Recall the discussion on transfer learning earlier, where the source and target domains are different. One way this difference manifests itself is through having a mismatch in tokens between the training data and the pre-trained model. As we go through the next steps, this will become more apparent.
After this embedding matrix of zeros is initialized, it needs to be populated. For each word in the vocabulary of reviews, the corresponding vector is retrieved from the GloVe dictionary.
The ID of the word is retrieved using the encoder, and then the embedding matrix entry corresponding to that entry is set to the retrieved vector:
unk_cnt = 0 unk_set = set() for word in imdb_encoder.tokens: embedding_vector = dict_w2v.get(word) if embedding_vector is not None: tkn_id = imdb_encoder.encode(word)[0] embedding_matrix[tkn_id] = embedding_vector else: unk_cnt += 1 unk_set.add(word) # Print how many weren't found print("Total unknown words: ", unk_cnt)
Total unknown words: 14553
During the data loading step, we saw that the total number of tokens was 93,931. Out of these, 14,553 words could not be found, which is approximately 15% of the tokens. For these words, the embedding matrix will have zeros. This is the first step in transfer learning. Now that the setup is completed, we will need to use TensorFlow to use these pre-trained embeddings. There will be two different models that will be tried – the first will be based on feature extraction and the second one on fine-tuning. We will dive deep into the details of these models in the book – Advanced Natural Language Processing with TensorFlow 2.
Summary
Deep learning models really shine with large amounts of training data. Having enough labeled data is a constant challenge in the field, especially in NLP. A successful approach that has yielded great results in the last couple of years is that of transfer learning. In this article, we built on the IMDb movie review sentiment analysis and used transfer learning to build models using GloVe (Global Vectors for Word Representation) pre-trained embeddings.
About the Author
Ashish is an AI/ML leader, a well-known speaker, and an astute technologist with over 20 years of experience in the field. He is currently the Director of Recommendations at Twitch where he works on building scalable recommendation systems across a variety of product surfaces, connecting content to people. He has worked on recommendation systems at multiple organizations, most notably Twitter where he led Trends and Events recommendations and at Capital One where he worked on B2B and B2C products. Ashish was also a co-founder of GALE Partners, a full-service digital agency in Toronto, and spent over 9 years at SapientNitro, a leading digital agency.