Conll2000 corpus
defines the chunks using IOB tags.
- It specifies where the chunk begins and ends, along with its types.
- A part-of-speech tagger can be trained on these IOB tags to further power a ChunkerI subclass.
- First using the chunked_sents() method of corpus, a tree is obtained and is then transformed to a format usable by a part-of-speech tagger.
- conll_tag_chunks() uses tree2conlltags() to convert a sentence Tree into a list of three tuples of the form (word, pos, iob).
- pos: part-of-speech tag
- IOB: IOB tag for example – B_NP, I_NP to tell that work is in the beginning and inside the noun phrase respectively.
- conlltags2tree() is reversal of tree2conlltags()
- 3-tuples are then converted into 2-tuples that the tagger can recognize.
- RegexpParser class uses part-of-speech tags for chunk patterns, so part-of-speech tags are used as if they were words to tag.
- conll_tag_chunks() function takes 3-tuples (word, pos, iob) and returns a list of 2-tuples of the form (pos, iob)
Code #1: Let’s understand
Python3
from nltk.chunk.util import tree2conlltags, conlltags2tree from nltk.tree import Tree t = Tree( 'S' , [Tree( 'NP' , [( 'the' , 'DT' ), ( 'book' , 'NN' )])]) print ( "Tree2conlltags : \n" , tree2conlltags(t)) c = conlltags2tree([( 'the' , 'DT' , 'B-NP' ), ( 'book' , 'NN' , 'I-NP' )]) print ( "\nconlltags2tree : \n" , c) # Converting 3 tuples to 2 tuples. print ( "\nconll_tag_chunnks for tree : \n" , conll_tag_chunks([t])) |
Output :
Tree2conlltags : [('the', 'DT', 'B-NP'), ('book', 'NN', 'I-NP')] conlltags2tree : Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')])]) conll_tag_chunnks for tree : [[('DT', 'B-NP'), ('NN', 'I-NP')]]
Code #2: TagChunker class using the conll2000 corpus
Python3
from chunkers import TagChunker from nltk.corpus import conll2000 # data conll_train = conll2000.chunked_sents( 'train.txt' ) conll_test = conll2000.chunked_sents( 'test.txt' ) # initializing the chunker chunker = TagChunker(conll_train) # testing score = chunker.evaluate(conll_test) a = score.accuracy() p = score.precision() r = recall print ( "Accuracy of TagChunker : " , a) print ( "\nPrecision of TagChunker : " , p) print ( "\nRecall of TagChunker : " , r) |
Output :
Accuracy of TagChunker : 0.8950545623403762 Precision of TagChunker : 0.8114841974355675 Recall of TagChunker : 0.8644191676944863
Note: The performance of conll2000 is not too good as treebank_chunk but conll2000 is a much larger corpus.
Code #3 : TagChunker using UnigramTagger Class
Python3
# loading libraries from chunkers import TagChunker from nltk.tag import UnigramTagger uni_chunker = TagChunker(train_chunks, tagger_classes = [UnigramTagger]) score = uni_chunker.evaluate(test_chunks) a = score.accuracy() print ( "Accuracy of TagChunker : " , a) |
Output :
Accuracy of TagChunker : 0.9674925924335466
The tagger_classes argument is passed directly to the backoff_tagger() function, so that means they must be subclasses of SequentialBackoffTagger. In testing, the default of tagger_classes = [UnigramTagger, BigramTagger] generally produces the best results, but it can vary with different corpuses.