Introduction
A few months back, when I initially began working at Office People, I developed an interest in Language Models, particularly Word2Vec. Being a native Python user, I naturally concentrated on Gensim’s Word2Vec implementation and looked for papers and tutorials online. I directly applied and duplicated code snippets from multiple sources, as any good data scientist would do. I delved further and deeper to attempt to understand what went wrong with my method, reading through Stackoverflow conversations, Gensim’s Google Groups, and the library’s documentation.
However, I always thought that one of the most important aspects of creating a Word2Vec model was missing. During my experiments, I discovered that lemmatizing the sentences or looking for phrases/bigrams in them had a significant impact on the results and performance of my models. Though the impact of preprocessing varies depending on the dataset and application, I decided to include the data preparation steps in this article and use the fantastic spaCy library alongside it.
Some of these issues irritate me, so I decided to write my own article. I don’t promise that it’s perfect or the best way to implement Word2Vec, just that it’s better than a lot of what’s out there.
Learning Objectives
- Understand word embeddings and their role in capturing semantic relationships.
- Implement Word2Vec models using popular libraries like Gensim or TensorFlow.
- Measure word similarity and calculate distances using Word2Vec embeddings.
- Explore word analogies and semantic relationships captured by Word2Vec.
- Apply Word2Vec in various NLP tasks such as sentiment analysis and machine translation.
- Learn techniques to fine-tune Word2Vec models for specific tasks or domains.
- Handle out-of-vocabulary words using subword information or pre-trained embeddings.
- Understand limitations and trade-offs of Word2Vec, such as word sense disambiguation and sentence-level semantics.
- Dive into advanced topics like subword embeddings and model optimization with Word2Vec.
This article was published as a part of the Data Science Blogathon.
Table of contents
- Introduction
- Brief About Word2Vec
- Implementation of Word2vec
- Setting up the Environment
- Dataset
- Preprocessing
- Cleaning
- Bigrams
- Most Frequent Words
- Separate the Training of the Model into 3 Steps
- Building the Vocabulary Table
- Training of the Model
- Exploring the Model
- Conclusion
- Frequently Answer and Questions
Brief About Word2Vec
A Google team of researchers introduced Word2Vec in two papers between September and October 2013. The researchers also published their C implementation alongside the papers. Gensim completed the Python implementation shortly after the first paper.
The underlying assumption of Word2Vec is that two words with similar contexts have similar meanings and, as a result, a similar vector representation from the model. For example, “dog,” “puppy,” and “pup” are frequently used in similar contexts, with similar surrounding words such as “good,” “fluffy,” or “cute,” and thus have a similar vector representation according to Word2Vec.
Based on this assumption, Word2Vec can be used to discover the relationships between words in a dataset, compute their similarity, or use the vector representation of those words as input for other applications like text classification or clustering.
Implementation of Word2vec
The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “Show me your friends, and I’ll tell you who you are”. Here’s an implementation of word2vec.
Setting up the Environment
python==3.6.3
Libraries used:
- xlrd==1.1.0:
- spaCy==2.0.12:
- gensim==3.4.0:
- scikit-learn==0.19.1:
- seaborn==0.8:
import re # For preprocessing
import pandas as pd # For data handling
from time import time # To time our operations
from collections import defaultdict # For word frequency
import spacy # For preprocessing
import logging # Setting up the loggings to monitor gensim
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s",
datefmt= '%H:%M:%S', level=logging.INFO)
Dataset
This dataset contains information about the characters, locations, episode details, and script lines for over 600 Simpsons episodes dating back to 1989. It is available at Kaggle. (~25MB)
Preprocessing
While doing preprocessing will keep only two columns from a dataset which are raw_character_text and spoken_words.
- raw_character_text: the character who speaks (useful for tracking preprocessing steps).
- spoken_words: the raw text from the dialogue line
Because we want to do our own preprocessing, we don’t keep normalized_text.
df = pd.read_csv('../input/simpsons_dataset.csv')
df.shape
df.head()
The missing values are from a section of the script where something happens but there is no dialogue. “(Springfield Elementary School: EXT. ELEMENTARY – SCHOOL PLAYGROUND – AFTERNOON)” is an example.
df.isnull().sum()
Cleaning
For each line of dialogue, we are lemmatizing and removing stopwords and non-alphabetic characters.
nlp = spacy.load('en', disable=['ner', 'parser'])
def cleaning(doc):
# Lemmatizes and removes stopwords
# doc needs to be a spacy Doc object
txt = [token.lemma_ for token in doc if not token.is_stop]
if len(txt) > 2:
return ' '.join(txt)
Removes non-alphabetic characters:
brief_cleaning = (re.sub("[^A-Za-z']+", ' ', str(row)).lower() for row in df['spoken_words'])
Using the spaCy.pipe() attribute to accelerate the cleaning process:
t = time()
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000,
n_threads=-1)]
print('Time to clean up everything: {} mins'.format(round((time() - t) / 60, 2)))
To remove missing values and duplicates, place the results in a DataFrame:
df_clean = pd.DataFrame({'clean': txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape
Bigrams
Bigrams are a concept used in natural language processing and text analysis. They refer to consecutive pairs of words or characters that appear in a sequence of text. By analyzing bigrams, we can gain insights into the relationships between words or characters in a given text.
Let’s take an example sentence: “I love ice cream”. To identify the bigrams in this sentence, we look at pairs of consecutive words:
“I love”
“love ice”
“ice cream”
Each of these pairs represents a bigram. Bigrams can be useful in various language processing tasks. For example, in language modeling, we can use bigrams to predict the next word in a sentence based on the previous word.
Bigrams can be extended to larger sequences called trigrams (consecutive triplets) or n-grams (consecutive sequences of n words or characters). The choice of n depends on the specific analysis or task at hand.
The Gensim Phrases package is being used to automatically detect common phrases (bigrams) from a list of sentences. https://radimrehurek.com/gensim/models/phrases.html
We do this primarily to capture words like “mr_burns” and “bart_simpson”!
from gensim.models.phrases import Phrases, Phraser
sent = [row.split() for row in df_clean['clean']]
The following phrases are generated from the list of sentences:
phrases = Phrases(sent, min_count=30, progress_per=10000)
The goal of Phraser() is to reduce Phrases() memory consumption by discarding model state that is not strictly required for the bigram detection task:
bigram = Phraser(phrases)
Transform the corpus based on the bigrams detected:
sentences = bigram[sent]
Most Frequent Words
Mostly a sanity check on the effectiveness of the lemmatization, stopword removal, and bigram addition.
word_freq = defaultdict(int)
for sent in sentences:
for i in sent:
word_freq[i] += 1
len(word_freq)
sorted(word_freq, key=word_freq.get, reverse=True)[:10]
Separate the Training of the Model into 3 Steps
For clarity and monitoring, I prefer to divide the training into three distinct steps.
- Word2Vec():
- In this first step, I set up the model’s parameters one by one.
- I intentionally leave the model uninitialized by not providing the parameter sentences.
- build_vocab():
- It initializes the model by building the vocabulary from a sequence of sentences.
- I can track the progress and, more importantly, the effect of min_count and sample on the word corpus using the loggings. I discovered that these two parameters, particularly sample, have a significant impact on model performance. Displaying both enables more accurate and simple management of their influence.
- .train():
- Finally, the model is trained.
- The loggings on this page are mostly useful.
import multiprocessing
from gensim.models import Word2Vec
cores = multiprocessing.cpu_count() # Count the number of cores in a computer
w2v_model = Word2Vec(min_count=20,
window=2,
size=300,
sample=6e-5,
alpha=0.03,
min_alpha=0.0007,
negative=20,
workers=cores-1)
Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html
Building the Vocabulary Table
Word2Vec requires us to create the vocabulary table (by digesting all of the words, filtering out the unique words, and performing some basic counts on them):
t = time()
w2v_model.build_vocab(sentences, progress_per=10000)
print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))
The vocabulary table is crucial for encoding words as indices and looking up their corresponding word embeddings during training or inference. It forms the foundation for training Word2Vec models and enables efficient word representation in the continuous vector space.
Training of the Model
Training a Word2Vec model involves feeding a corpus of text data into the algorithm and optimizing the model’s parameters to learn word embeddings. The training parameters for Word2Vec include various hyperparameters and settings that affect the training process and the quality of the resulting word embeddings. Here are some commonly used training parameters for Word2Vec:
- total_examples = int – The number of sentences;
- epochs = int – The number of iterations (epochs) over the corpus – [10, 20, 30]
t = time()
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))
We are calling init_sims() to make the model much more memory-efficient since we do not intend to train it further:
w2v_model.init_sims(replace=True)
These parameters control aspects such as the context window size, the trade-off between frequent and rare words, the learning rate, the training algorithm, and the number of negative samples for negative sampling. Adjusting these parameters can impact the quality, efficiency, and memory requirements of the Word2Vec training process.
Exploring the Model
Once a Word2Vec model is trained, you can explore it to gain insights into the learned word embeddings and extract useful information. Here are some ways to explore the Word2Vec model:
Most similar To
In Word2Vec, you can find the words most similar to a given word based on the learned word embeddings. The similarity is typically calculated using cosine similarity. Here’s an example of finding words most similar to a target word using Word2Vec:
Let’s see what we get for the show’s main character:
similar_words = w2v_model.wv.most_similar(positive=["homer"])
for word, similarity in similar_words:
print(f"{word}: {similarity}")
Just to be clear, when we look at the words that are most similar to “homer,” we do not necessarily get his family members, personality traits, or even his most memorable quotes.
Compare that to what the bigram “homer_simpson” returns:
w2v_model.wv.most_similar(positive=["homer_simpson"])
What about Marge now?
w2v_model.wv.most_similar(positive=["marge"])
Let’s check Bart now:
w2v_model.wv.most_similar(positive=["bart"])
Looks like it is making sense!
Similarities
Here’s an example of finding the cosine similarity between two words using Word2Vec:
Example: Calculating cosine similarity between two words.
w2v_model.wv.similarity("moe_'s", 'tavern')
Who could forget Moe’s tavern? Not Barney.
w2v_model.wv.similarity('maggie', 'baby')
Maggie is indeed the most renown baby in the Simpsons!
w2v_model.wv.similarity('bart', 'nelson')
Bart and Nelson, though friends, are not that close, makes sense!
Odd-One-Out
Here, we ask our model to give us the word that does not belong to the list!
Between Jimbo, Milhouse, and Kearney, who is the one who is not a bully?
w2v_model.wv.doesnt_match(['jimbo', 'milhouse', 'kearney'])
What if we compared the friendship between Nelson, Bart, and Milhouse?
w2v_model.wv.doesnt_match(["nelson", "bart", "milhouse"])
Seems like Nelson is the odd one here!
Last but not least, how is the relationship between Homer and his two sister-in-laws?
w2v_model.wv.doesnt_match(['homer', 'patty', 'selma'])
Damn, they really do not like you Homer!
Analogy Difference
Which word is to woman as homer is to marge?
w2v_model.wv.most_similar(positive=["woman", "homer"], negative=["marge"], topn=3)
“man” comes at the first position, that looks about right!
Which word is to woman as Bart is to man?
w2v_model.wv.most_similar(positive=["woman", "bart"], negative=["man"], topn=3)
Lisa is Bart’s sister, her male counterpart!
Conclusion
In conclusion, Word2Vec is a widely used algorithm in the field of natural language processing (NLP) that learns word embeddings by representing words as dense vectors in a continuous vector space. It captures semantic and syntactic relationships between words based on their co-occurrence patterns in a large corpus of text.
Word2Vec works by utilizing either the Continuous Bag-of-Words (CBOW) or Skip-gram model, which are neural network architectures. Word embeddings, generated by Word2Vec, are dense vector representations of words that encode semantic and syntactic information. They allow for mathematical operations like word similarity calculation and can be used as features in various NLP tasks.
Key Takeaways
- Word2Vec learns word embeddings, dense vector representations of words.
- It analyzes co-occurrence patterns in a text corpus to capture semantic relationships.
- The algorithm uses a neural network with either CBOW or Skip-gram model.
- Word embeddings enable word similarity calculations.
- They can be used as features in various NLP tasks.
- Word2Vec requires a large training corpus for accurate embeddings.
- It does not capture word sense disambiguation.
- Word order is not considered in Word2Vec.
- Out-of-vocabulary words may pose challenges.
- Despite limitations, Word2Vec has significant applications in NLP.
While Word2Vec is a powerful algorithm, it has some limitations. It requires a large amount of training data to learn accurate word embeddings. It treats each word as an atomic entity and does not capture word sense disambiguation. Out-of-vocabulary words may pose a challenge, as they have no pre-existing embeddings.
Word2Vec has significantly contributed to advancements in NLP and continues to be a valuable tool for tasks such as information retrieval, sentiment analysis, machine translation, and more.
Frequently Answer and Questions
A: Word2Vec is a popular algorithm for natural language processing (NLP) tasks. A shallow, two-layer neural network learns word embeddings by representing words as dense vectors in a continuous vector space. Word2Vec captures the semantic and syntactic relationships between words based on their co-occurrence patterns in a large text corpus.
A: Word2Vec uses a technique called “distributed representation” to learn word embeddings. It employs a neural network architecture, either the Continuous Bag-of-Words (CBOW) or Skip-gram model. The CBOW model predicts the target word based on its context words, while the Skip-gram model predicts the context words given a target word. During training, the model adjusts the word vectors to maximize the likelihood of correctly predicting the target or context words.
A: Word embeddings are dense vector representations of words in a continuous vector space. They encode semantic and syntactic information about words, capturing their relationships based on their distributional properties in the training corpus. They enable mathematical operations like word similarity calculation and use them as features in various NLP tasks, such as sentiment analysis, machine translation etc.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.