Introduction :
This article focuses on basic feature extraction techniques in NLP to analyse the similarities between pieces of text. Natural Language Processing (NLP) is a branch of computer science and machine learning that deals with training computers to process a large amount of human (natural) language data. Briefly, NLP is the ability of computers to understand human language. Need of feature extraction techniques Machine Learning algorithms learn from a pre-defined set of features from the training data to produce output for the test data. But the main problem in working with language processing is that machine learning algorithms cannot work on the raw text directly. So, we need some feature extraction techniques to convert text into a matrix(or vector) of features. Some of the most popular methods of feature extraction are :
- Bag-of-Words
- TF-IDF
Bag of Words:
The bag of words model is used for text representation and feature extraction in natural language processing and information retrieval tasks. It represents a text document as a multiset of its words, disregarding grammar and word order, but keeping the frequency of words. This representation is useful for tasks such as text classification, document similarity, and text clustering.
Bag-of-Words is one of the most fundamental methods to transform tokens into a set of features. The BoW model is used in document classification, where each word is used as a feature for training the classifier. For example, in a task of review based sentiment analysis, the presence of words like ‘fabulous’, ‘excellent’ indicates a positive review, while words like ‘annoying’, ‘poor’ point to a negative review . There are 3 steps while creating a BoW model :
- The first step is text-preprocessing which involves:
- converting the entire text into lower case characters.
- removing all punctuations and unnecessary symbols.
- The second step is to create a vocabulary of all unique words from the corpus. Let’s suppose, we have a hotel review text. Let’s consider 3 of these reviews, which are as follows :
- good movie
- not a good movie
- did not like
- Now, we consider all the unique words from the above set of reviews to create a vocabulary, which is going to be as follows :
{good, movie, not, a, did, like}
- In the third step, we create a matrix of features by assigning a separate column for each word, while each row corresponds to a review. This process is known as Text Vectorization. Each entry in the matrix signifies the presence(or absence) of the word in the review. We put 1 if the word is present in the review, and 0 if it is not present.
For the above example, the matrix of features will be as follows :
good | movie | not | a | did | like |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 0 | 0 |
0 | 0 | 1 | 0 | 1 | 1 |
A major drawback in using this model is that the order of occurrence of words is lost, as we create a vector of tokens in randomised order.However, we can solve this problem by considering N-grams(mostly bigrams) instead of individual words(i.e. unigrams). This can preserve local ordering of words. If we consider all possible bigrams from the given reviews, the above table would look like:
good movie | movie | did not | a | … |
---|---|---|---|---|
1 | 1 | 0 | 0 | … |
1 | 1 | 0 | 1 | … |
0 | 0 | 1 | 0 | … |
However, this table will come out to be very large, as there can be a lot of possible bigrams by considering all possible consecutive word pairs. Also, using N-grams can result in a huge sparse(has a lot of 0’s) matrix, if the size of the vocabulary is large, making the computation really complex!! Thus, we have to remove a few N-grams based on their frequency. Like, we can always remove high-frequency N-grams, because they appear in almost all documents. These high-frequency N-grams are generally articles, determiners, etc. most commonly called as StopWords. Similarly, we can also remove low frequency N-grams because these are really rare(i.e. generally appear in 1 or 2 reviews)!! These types of N-grams are generally typos(or typing mistakes). Generally, medium frequency N-grams are considered as the most ideal. However, there are some N-grams which are really rare in our corpus but can highlight a specific issue. Let’s suppose, there is a review that says – “Wi-Fi breaks often”. Here, the N-gram ‘Wi-Fi breaks can’t be too frequent, but it highlights a major problem that needs to be looked upon. Our BoW model would not capture such N-grams since its frequency is really low. To solve this type of problem, we need another model i.e. TF-IDF Vectorizer, which we will study next. Code : Python code for creating a BoW model is:
Python3
# Creating the Bag of Words model word2count = {} for data in dataset: words = nltk.word_tokenize(data) for word in words: if word not in word2count.keys(): word2count[word] = 1 else : word2count[word] + = 1 |
Issues of Bag of Words:
The following are some of the issues with the Bag of Words model for text representation and analysis:
- High dimensionality: The resulting feature space can be very high dimensional, which may lead to issues with overfitting and computational efficiency.
- Lack of context information: The bag of words model only considers the frequency of words in a document, disregarding grammar, word order, and context.
- Insensitivity to word associations: The bag of words model doesn’t consider the associations between words, and the semantic relationships between words in a document.
- Lack of semantic information: As the bag of words model only considers individual words, it does not capture semantic relationships or the meaning of words in context.
- Importance of stop words: Stop words, such as “the”, “and”, “a”, etc., can have a large impact on the bag of words representation of a document, even though they may not carry much meaning.
- Sparsity: For many applications, the bag of words representation of a document can be very sparse, meaning that most entries in the resulting feature vector will be zero. This can lead to issues with computational efficiency and difficulty in interpretability.
TF-IDF Vectorizer :
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used for information retrieval and natural language processing tasks. It reflects the importance of a word in a document relative to an entire corpus. The basic idea is that a word that occurs frequently in a document but rarely in the entire corpus is more informative than a word that occurs frequently in both the document and the corpus.
TF-IDF is used for:
1. Text retrieval and information retrieval systems
2. Document classification and text categorization
3. Text summarization
4. Feature extraction for text data in machine learning algorithms.
TF-IDF stands for term frequency-inverse document frequency. It highlights a specific issue which might not be too frequent in our corpus but holds great importance. The TF–IFD value increases proportionally to the number of times a word appears in the document and decreases with the number of documents in the corpus that contain the word. It is composed of 2 sub-parts, which are :
- Term Frequency (TF)
- Inverse Document Frequency (IDF)
Term Frequency(TF) : Term frequency specifies how frequently a term appears in the entire document.It can be thought of as the probability of finding a word within the document.It calculates the number of times a word occurs in a review , with respect to the total number of words in the review .It is formulated as:
A different scheme for calculating tf is log normalization. And it is formulated as:
where, is the frequency of the term t in document D. Inverse Document Frequency(IDF) : The inverse document frequency is a measure of whether a term is rare or frequent across the documents in the entire corpus. It highlights those words which occur in very few documents across the corpus, or in simple language, the words that are rare have high IDF score. IDF is a log normalised value, that is obtained by dividing the total number of documents in the corpus by the number of documents containing the term , and taking the logarithm of the overall term.
where, is the frequency of the term t in document D. is the total number of documents in the corpus. is the count of documents in the corpus, which contains the term t. Since the ratio inside the IDF’s log function has to be always greater than or equal to 1, so the value of IDF (and thus tf–idf) is greater than or equal to 0.When a term appears in large number of documents, the ratio inside the logarithm approaches 1, and the IDF is closer to 0. Term Frequency-Inverse Document Frequency(TF-IDF) TF-IDF is the product of TF and IDF. It is formulated as:
A high TF-IDF score is obtained by a term that has a high frequency in a document, and low document frequency in the corpus. For a word that appears in almost all documents, the IDF value approaches 0, making the tf-idf also come closer to 0.TF-IDF value is high when both IDF and TF values are high i.e the word is rare in the whole document but frequent in a document. Let’s take the same example to understand this better:
- good movie
- not a good movie
- did not like
In this example, each sentence is a separate document. Considering the bigram model, we calculate the TF-IDF values for each bigram :
good movie | movie | did not | |
---|---|---|---|
good movie | 1*log(3/2) = 0.17 | 1*log(3/2) = 0.17 | 0*log(3/1) = 0 |
not a good movie | 1*log(3/2) = 0.17 | 1*log(3/2) = 0.17 | 0*log(3/1) = 0 |
did not like | 0*log(3/2) = 0 | 0*log(3/2) = 0 | 1*log(3/1) = 0.47 |
Here, we observe that the bigram did not is rare(i.e. appears in only one document), as compared to other tokens, and thus has a higher tf-idf score. Code : Using the python in-built function TfidfVectorizer to calculate tf-idf score for any corpus
Python3
# calculating tf-idf values from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd texts = { "good movie", " not a good movie", "did not like" } tfidf = TfidfVectorizer(min_df = 2 , max_df = 0.5 , ngram_range = ( 1 , 2 )) features = tfidf.fit_transform(texts) pd.Dataframe{ features.todense(), columns = tfidf.get_feature_names() } |
On a concluding note, we can say that though Bag-of-Words is one of the most fundamental methods in feature extraction and text vectorization, it fails to capture certain issues in the text. However, this problem is solved by TF-IDF Vectorizer, which also is a feature extraction method, that captures some of the major issues which are not too frequent in the entire corpus.
Issues of TF-IDF :
The following are some of the issues with using TF-IDF for text representation and analysis:
- High dimensionality: The resulting feature space can be very high dimensional, which may lead to issues with overfitting and computational efficiency.
- Lack of context information: TF-IDF only considers the frequency of words in a document, disregarding the context and meaning of words.
- Domain dependence: The results of TF-IDF can be domain-specific, as the frequency and importance of words can vary greatly depending on the domain of the text.
- Insensitivity to word associations: TF-IDF doesn’t consider the associations between words, and the semantic relationships between words in a document.
- Lack of semantic information: As TF-IDF only considers individual words, it does not capture semantic relationships or the meaning of words in context.
Importance of stop words: Stop words, such as “the”, “and”, “a”, etc., can have a large impact on the TF-IDF representation of a document, even though they may not carry much meaning.
Reference :
Here are some references for further reading on the Bag of Words and TF-IDF models in natural language processing and information retrieval:
- “Text Classification and Naive Bayes” by Debasis Das, Journal of Emerging Trends in Computing and Information Sciences
- “A comparative study of feature selection and classification algorithms for sentiment analysis” by Esmaeilpour, M. and others, Journal of Ambient Intelligence and Humanized Computing
- “Information Retrieval” by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze, Cambridge University Press
- “Natural Language Processing with Python” by Steven Bird, Ewan Klein, and Edward Loper, O’Reilly Media
- “Speech and Language Processing” by Daniel Jurafsky and James H. Martin, Prentice Hall.