Regular expression matching is used to tag words. Consider the example, numbers can be matched with \d to assign the tag CD (which refers to a Cardinal number). Or one can match the known word patterns, such as the suffix “ing”.
Understanding the concept –
- RegexpTagger is a subclass of SequentialBackoffTagger. It can be positioned before a DefaultTagger class so as to tag words that the n-gram tagger(s) missed and thus can be a useful part of a backoff chain.
- At initialization, patterns are saved in RegexpTagger class. choose_tag() is then called, it iterates over the patterns. Then, it returns the first expression tag that can match the current word using re.match().
- So, if the two given expressions get matched, then the tag of the first one will be returned without even trying the second expression.
- If the given pattern is like – (r’.*’, ‘NN’), RegexpTagger class can replace the DefaultTagger class
Code #1 : Python regular expression module and re syntax
Python3
patterns = [(r '^\d+$' , 'CD' ), # gerunds, i.e. wondering (r '.*ing$' , 'VBG' ), # i.e. wonderment (r '.*ment$' , 'NN' ), # i.e. wonderful (r '.*ful$' , 'JJ' )] |
RegexpTagger class expects a list of two tuples
-> first element in the tuple is a regular expression -> second element is the tag
Code #2 : Using RegexpTagger
Python3
# Loading Libraries from tag_util import patterns from nltk.tag import RegexpTagger from nltk.corpus import treebank test_data = treebank.tagged_sents()[ 3000 :] tagger = RegexpTagger(patterns) print ( "Accuracy : " , tagger.evaluate(test_data)) |
Output :
Accuracy : 0.037470321605870924
What is Affix tagging?
It is a subclass of ContextTagger. In the case of AffixTagger class, the context is either the suffix or the prefix of a word. So, it clearly indicates that this class can learn tags based on fixed-length substrings of the beginning or end of a word.
It specifies the three-character suffixes. That words must be at least 5 characters long and None is returned as the tag if a word is less than five character.
Code #3 : Understanding AffixTagger.
Python3
# loading libraries from tag_util import word_tag_model from nltk.corpus import treebank from nltk.tag import AffixTagger # initializing training and testing set train_data = treebank.tagged_sents()[: 3000 ] test_data = treebank.tagged_sents()[ 3000 :] print ( "Train data : \n" , train_data[ 1 ]) # Initializing tagger tag = AffixTagger(train_data) # Testing print ( "\nAccuracy : " , tag.evaluate(test_data)) |
Output :
Train data : [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (', ', ', '), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')] Accuracy : 0.27558817181092166
Code #4 : AffixTagger by specifying 3 character prefixes.
Python3
# Specifying 3 character prefixes prefix_tag = AffixTagger(train_data, affix_length = 3 ) # Testing accuracy = prefix_tag.evaluate(test_data) print ( "Accuracy : " , accuracy) |
Output :
Accuracy : 0.23587308439456076
Code #5 : AffixTagger by specifying 2-character suffixes
Python3
# Specifying 2 character suffixes sufix_tag = AffixTagger(train_data, affix_length = - 2 ) # Testing accuracy = sufix_tag.evaluate(test_data) print ( "Accuracy : " , accuracy) |
Output :
Accuracy : 0.31940427368875457