Lemmatization is the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings context to the words. So it links words with similar meanings to one word.
Text preprocessing includes both Stemming as well as Lemmatization. Many times people find these two terms confusing. Some treat these two as the same. Actually, lemmatization is preferred over Stemming because lemmatization does morphological analysis of the words.
Applications of lemmatization are:
- Used in comprehensive retrieval systems like search engines.
- Used in compact indexing
Examples of lemmatization: -> rocks : rock -> corpora : corpus -> better : good
One major difference with stemming is that lemmatize takes a part of speech parameter, “pos” If not supplied, the default is “noun.”
Below is the implementation of lemmatization words using NLTK:
Python3
# import these modules from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print ( "rocks :" , lemmatizer.lemmatize( "rocks" )) print ( "corpora :" , lemmatizer.lemmatize( "corpora" )) # a denotes adjective in "pos" print ( "better :" , lemmatizer.lemmatize( "better" , pos = "a" )) |
Output :
rocks : rock corpora : corpus better : good
NLTK (Natural Language Toolkit) is a Python library used for natural language processing. One of its modules is the WordNet Lemmatizer, which can be used to perform lemmatization on words.
Lemmatization is the process of reducing a word to its base or dictionary form, known as the lemma. For example, the lemma of the word “cats” is “cat”, and the lemma of “running” is “run”.
Advantages of Lemmatization with NLTK:
- Improves text analysis accuracy: Lemmatization helps in improving the accuracy of text analysis by reducing words to their base or dictionary form. This makes it easier to identify and analyze words that have similar meanings.
- Reduces data size: Since lemmatization reduces words to their base form, it helps in reducing the data size of the text, which makes it easier to handle large datasets.
- Better search results: Lemmatization helps in retrieving better search results since it reduces different forms of a word to a common base form, making it easier to match different forms of a word in the text.
- Useful for feature extraction: Lemmatization can be useful in feature extraction tasks, where the aim is to extract meaningful features from text for machine learning tasks.
Disadvantages of Lemmatization with NLTK:
- Requires prior knowledge: Lemmatization requires prior knowledge of the language and the rules governing the formation of words. If the rules for a specific language are not available, then the accuracy of the lemmatizer may be affected.
- Time-consuming: Lemmatization can be time-consuming since it involves parsing the text and performing a lookup in a dictionary or a database of word forms.
- Not suitable for real-time applications: Since lemmatization is time-consuming, it may not be suitable for real-time applications that require quick response times.
- May lead to ambiguity: Lemmatization may lead to ambiguity, as a single word may have multiple meanings depending on the context in which it is used. In such cases, the lemmatizer may not be able to determine the correct meaning of the word.