NLP | Likely Word Tags

23 July 2024

2

nltk.probability.FreqDist is used to find the most common words by counting word frequencies in the treebank corpus. ConditionalFreqDist class is created for tagged words, where we count the frequency of every tag for every word. These counts are then used too construct a model of the frequent words as keys, with the most frequent tag for each word as a value. Code #1 : Creating function

Python3

# Loading Libraries
from nltk.probability import FreqDist, ConditionalFreqDist
 
# Making function
def word_tag_model(words, tagged_words, limit = 200):
     
    fd = FreqDist(words)
    cfd = ConditionalFreqDist(tagged_words)
    most_freq = (word for word, count in fd.most_common(limit))
     
return dict((word, cfd[word].max()) 
             for word in most_freq)

Code #2 : Using the function with UnigramTagger

Python3

# loading libraries
from tag_util import word_tag_model
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
 
# initializing training and testing set    
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]
 
# Initializing the model
model = word_tag_model(treebank.words(), 
                       treebank.tagged_words())
 
# Initializing the Unigram
tag = UnigramTagger(model = model)
 
print ("Accuracy : ", tag.evaluate(test_data))

Output :

Accuracy : 0.559680552557738

Code #3 : Let’s try backoff chain

Python3

# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
 
default_tagger = DefaultTagger('NN')
 
likely_tagger = UnigramTagger(
        model = model, backoff = default_tagger)
 
tag = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger, 
        TrigramTagger], backoff = likely_tagger)
     
print ("Accuracy : ", tag.evaluate(test_data))

Output :

Accuracy : 0.8806820634578028

Note : Backoff chain has increases the accuracy. We can improve this result further by effectively using UnigramTagger class. Code #4 : Manual Override of Trained Taggers

Python3

# Loading libraries
from nltk.tag import UnigramTagger
from nltk.tag import DefaultTagger
 
default_tagger = DefaultTagger('NN')
 
tagger = backoff_tagger(train_sents, [
        UnigramTagger, BigramTagger,
        TrigramTagger], backoff = default_tagger)
     
likely_tag = UnigramTagger(model = model, backoff = tagger)
 
print ("Accuracy : ", likely_tag.evaluate(test_data))

Output :

Accuracy : 0.8824088063889488

NLP | Likely Word Tags

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

7 Best Free Antiviruses for Mac in 2024: Are They Any Good? by Katarina Glamoslija

Recent Comments

EDITOR PICKS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR POSTS

Verizon will basically pay you to buy the new, awesome Barbie phone

8 Best VPNs for Apple TV in 2024: Fast & Secure by Penka Hristovska

Samsung offers free screen replacements for users still suffering green line issues

POPULAR CATEGORY

ABOUT US

FOLLOW US