Sunday, May 17, 2026
HomeLanguagesNLP | Chunking using Corpus Reader

NLP | Chunking using Corpus Reader

What are Chunks? 
These are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.
How it works : 
 

  • The ChunkedCorpusReader class works similar to the TaggedCorpusReader for getting tagged tokens, plus it also provides three new methods for getting chunks.
  • An instance of nltk.tree.Tree represents each chunk.
  • Noun phrase trees look like Tree(‘NP’, […]) where as Sentence level trees look like Tree(‘S’, […]).
  • A list of sentence trees, with each noun phrase as a subtree of the sentence is obtained in n chunked_sents()
  • A list of noun phrase trees alongside tagged tokens of words that were not in a chunk is obtained in chunked_words().

Diagram listing the major methods: 
 

Code #1 : Creating a ChunkedCorpusReader for words 

Python3




# Using ChunkedCorpusReader
from nltk.corpus.reader import ChunkedCorpusReader
 
# initializing
x = ChunkedCorpusReader('.', r'.*\.chunk')
 
words = x.chunked_words()
print ("Words : \n", words)


Output : 

Words : 
[Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), 
('moves', 'NNS')]), ('have', 'VBP'), ...]

Code #2 : For sentence 

Python3




Chunked Sentence = x.chunked_sents()
print ("Chunked Sentence : \n", tagged_sent)


Output : 

Chunked Sentence : 
[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction', 'NN'), 
('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'), ('about', 'IN'), 
Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (', ', ', '),
Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]

Code #3 : For paragraphs 

Python3




para = x.chunked_paras()()
print ("para : \n", para)


Output : 

[[Tree('S', [Tree('NP', [('Earlier', 'JJR'), ('staff-reduction',
'NN'), ('moves', 'NNS')]), ('have', 'VBP'), ('trimmed', 'VBN'),
('about', 'IN'), 
Tree('NP', [('300', 'CD'), ('jobs', 'NNS')]), (', ', ', '), 
Tree('NP', [('the', 'DT'), ('spokesman', 'NN')]), ('said', 'VBD'), ('.', '.')])]] 
Dominic
Dominichttp://wardslaus.com
infosec,malicious & dos attacks generator, boot rom exploit philanthropist , wild hacker , game developer,
RELATED ARTICLES

1 COMMENT

Most Popular

Dominic
32514 POSTS0 COMMENTS
Milvus
131 POSTS0 COMMENTS
Nango Kala
6892 POSTS0 COMMENTS
Nicole Veronica
12012 POSTS0 COMMENTS
Nokonwaba Nkukhwana
12107 POSTS0 COMMENTS
Shaida Kate Naidoo
7016 POSTS0 COMMENTS
Ted Musemwa
7262 POSTS0 COMMENTS
Thapelo Manthata
6975 POSTS0 COMMENTS
Umr Jansen
6963 POSTS0 COMMENTS