To train a chunker is an alternative to manually specifying regular expression (regex) chunk patterns. But manually training to specify the expression is a tedious task to do as it follows the hit and trial method to get the exact right patterns. So, existing corpus data can be used to train chunkers.
In the codes below, we are using treebank_chunk corpus to produce chunked sentences in the form of trees.
-> To train a tagger-based chunker – chunked_sents() methods are used by a TagChunker class.
-> To extract a list of (pos, iob) tuples from a list of Trees – the TagChunker class uses a helper function, conll_tag_chunks().
These tuples are then finally used to train a tagger. and it learns IOB tags for part-of-speech tags.
Code #1 : Let’s understand the Chunker class for training.
from nltk.chunk import ChunkParserI from nltk.chunk.util import tree2conlltags, conlltags2tree from nltk.tag import UnigramTagger, BigramTagger from tag_util import backoff_tagger def conll_tag_chunks(chunk_data): tagged_data = [tree2conlltags(tree) for tree in chunk_data] return [[(t, c) for (w, t, c) in sent] for sent in tagged_data] class TagChunker(ChunkParserI): def __init__( self , train_chunks, tagger_classes = [UnigramTagger, BigramTagger]): train_data = conll_tag_chunks(train_chunks) self .tagger = backoff_tagger(train_data, tagger_classes) def parse( self , tagged_sent): if not tagged_sent: return None (words, tags) = zip ( * tagged_sent) chunks = self .tagger.tag(tags) wtc = zip (words, chunks) return conlltags2tree([(w, t, c) for (w, (t, c)) in wtc]) |
Output :
Training TagChunker
Code #2 : Using the Tag Chunker.
# loading libraries from chunkers import TagChunker from nltk.corpus import treebank_chunk # data from treebank_chunk corpus train_data = treebank_chunk.chunked_sents()[: 3000 ] test_data = treebank_chunk.chunked_sents()[ 3000 :] # Initailazing chunker = TagChunker(train_data) |
Code #3 : Evaluating the TagChunker
# testing score = chunker.evaluate(test_data) a = score.accuracy() p = score.precision() r = recall print ( "Accuracy of TagChunker : " , a) print ( "\nPrecision of TagChunker : " , p) print ( "\nRecall of TagChunker : " , r) |
Output :
Accuracy of TagChunker : 0.9732039335251428 Precision of TagChunker : 0.9166534370535006 Recall of TagChunker : 0.9465573770491803