Saturday, September 6, 2025
HomeLanguagesNLP | IOB tags

NLP | IOB tags

What are Chunks? 
Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. 
 

What are IOB tags? 
It is a format for chunks. These tags are similar to part-of-speech tags but can denote the inside, outside, and beginning of a chunk. Not just noun phrases but multiple different chunk phrase types are allowed here. 
Example: It is an excerpt from the conll2000 corpus. Each word is with a part-of-speech tag followed by an IOB tag on its own line:
 

Mr. NNP B-NP
Meador NNP I-NP
had VBD B-VP
been VBN I-VP
executive JJ B-NP
vice NN I-NP
president NN I-NP
of IN B-PP
Balcor NNP B-NP

What does it mean? 
B-NP: the beginning of a noun phrase 
I-NP: describes that the word is inside of the current noun phrase. 
O: end of the sentence. 
B-VP and I-VP: beginning and inside of a verb phrase.
Code #1: How it works – chunking words with IOB tags. 
 

Python3




# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
 
# Initializing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
 
reader.chunked_words()
 
reader.iob_words()


Output : 
 

[Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]), Tree('VP', [('had', 'VBD'), 
('been', 'VBN')]), ...]

[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ...]

Code #2: How it works – chunking sentence with IOB tags. 
 

Python3




# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
 
# Initializing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
 
reader.chunked_sents()
 
reader.iob_sents()


Output : 
 

[Tree('S', [Tree('NP', [('Mr.', 'NNP'), ('Meador', 'NNP')]),
Tree('VP', [('had', 'VBD'), ('been', 'VBN')]), 
Tree('NP', [('executive', 'JJ'), ('vice', 'NN'), ('president', 'NN')]),
Tree('PP', [('of', 'IN')]), Tree('NP', [('Balcor', 'NNP')]), ('.', '.')])]

[[('Mr.', 'NNP', 'B-NP'), ('Meador', 'NNP', 'I-NP'), ('had', 'VBD', 'B-VP'), 
('been', 'VBN', 'I-VP'), ('executive', 'JJ', 'B-NP'), ('vice', 'NN', 'I-NP'), 
('president', 'NN', 'I-NP'), ('of', 'IN', 'B-PP'), ('Balcor', 'NNP', 'B-NP'), 
('.', '.', 'O')]]

Let’s understand the code above : 
 

  • For reading the corpus with IOB format, ConllChunkCorpusReader class is used.
  • No separation of paragraphs and each sentence is separated by a blank line, therefore para_* methods are not available.
  • Tuple or list specifying the types of chunks in the file like (‘NP’, ‘VP’, ‘PP’) serves as the third argument to ConllChunkCorpusReader.
  • iob_words() and iob_sents() methods returns lists of three tuples of (word, pos, iob)

Code #3: Tree Leaves – i.e. the tagged tokens 
 

Python3




# Loading the libraries
from nltk.corpus.reader import ConllChunkCorpusReader
 
# Initializing
reader = ConllChunkCorpusReader(
        '.', r'.*\.iob', ('NP', 'VP', 'PP'))
 
reader.chunked_words()[0].leaves()
 
reader.chunked_sents()[0].leaves()
 
reader.chunked_paras()[0][0].leaves()


Output : 
 

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS')]

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'),
('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'),
('jobs', 'NNS'), (', ', ', '), ('the', 'DT'), ('spokesman', 'NN'),
('said', 'VBD'), ('.', '.')]

[('Earlier', 'JJR'), ('staff-reduction', 'NN'), ('moves', 'NNS'),
('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), ('300', 'CD'),
('jobs', 'NNS'), (', ', ', '), ('the', 'DT'), ('spokesman', 'NN'),
('said', 'VBD'), ('.', '.')]

 

RELATED ARTICLES

Most Popular

Dominic
32270 POSTS0 COMMENTS
Milvus
82 POSTS0 COMMENTS
Nango Kala
6639 POSTS0 COMMENTS
Nicole Veronica
11803 POSTS0 COMMENTS
Nokonwaba Nkukhwana
11869 POSTS0 COMMENTS
Shaida Kate Naidoo
6752 POSTS0 COMMENTS
Ted Musemwa
7029 POSTS0 COMMENTS
Thapelo Manthata
6705 POSTS0 COMMENTS
Umr Jansen
6721 POSTS0 COMMENTS