The ClassifierBasedTagger class
learns from the features, unlike most part-of-speech taggers. ClassifierChunker class
can be created such that it can learn from both the words and part-of-speech tags, instead of just from the part-of-speech tags as the TagChunker class
does.
The (word, pos, iob) 3-tuples is converted into ((word, pos), iob) 2-tuples using the chunk_trees2train_chunks()
from tree2conlltags()
, to remain compatible with the 2-tuple (word, pos) format required for training a ClassiferBasedTagger class
.
Code #1 : Let’s understand
# Loading Libraries from nltk.chunk import ChunkParserI from nltk.chunk.util import tree2conlltags, conlltags2tree from nltk.tag import ClassifierBasedTagger def chunk_trees2train_chunks(chunk_sents): # Using tree2conlltags tag_sents = [tree2conlltags(sent) for sent in chunk_sents] 3 - tuple is converted to 2 - tuple return [[((w, t), c) for (w, t, c) in sent] for sent in tag_sents] |
Now, a feature detector function is needed to pass into ClassifierBasedTagger. Any feature detector function used with the ClassifierChunker class (defined next) should recognize that tokens are a list of (word, pos) tuples, and have the same function signature as prev_next_pos_iob(). To give the classifier as much information as we can, this feature set contains the current, previous, and next word and part-of-speech tag, along with the previous IOB tag.
Code #2 : detector function
def prev_next_pos_iob(tokens, index, history): word, pos = tokens[index] if index = = 0 : prevword, prevpos, previob = ( '<START>' , ) * 3 else : prevword, prevpos = tokens[index - 1 ] previob = history[index - 1 ] if index = = len (tokens) - 1 : nextword, nextpos = ( '<END>' , ) * 2 else : nextword, nextpos = tokens[index + 1 ] feats = { 'word' : word, 'pos' : pos, 'nextword' : nextword, 'nextpos' : nextpos, 'prevword' : prevword, 'prevpos' : prevpos, 'previob' : previob } return feats |
Now, ClassifierChunker class
is need which uses an internal ClassifierBasedTagger
with training sentences from chunk_trees2train_chunks()
and features extracted using prev_next_pos_iob()
. As a subclass of ChunkerParserI
, ClassifierChunker
implements the parse()
method to convert the ((w, t), c) tuples, produced by the internal tagger into Trees using conlltags2tree()
Code #3 :
class ClassifierChunker(ChunkParserI): def __init__( self , train_sents, feature_detector = prev_next_pos_iob, * * kwargs): if not feature_detector: feature_detector = self .feature_detector train_chunks = chunk_trees2train_chunks(train_sents) self .tagger = ClassifierBasedTagger(train = train_chunks, feature_detector = feature_detector, * * kwargs) def parse( self , tagged_sent): if not tagged_sent: return None chunks = self .tagger.tag(tagged_sent) return conlltags2tree( [(w, t, c) for ((w, t), c) in chunks]) |