Chunk extraction or partial parsing is a process of meaningful extracting short phrases from the sentence (tagged with Part-of-Speech).
Chunks are made up of words and the kinds of words are defined using the part-of-speech tags. One can even define a pattern or words that can’t be a part of chuck and such words are known as chinks. A ChunkRule class specifies what words or patterns to include and exclude in a chunk.
Defining Chunk patterns :
Chuck patterns are normal regular expressions which are modified and designed to match the part-of-speech tag designed to match sequences of part-of-speech tags. Angle brackets are used to specify an individual tag for example – to match a noun tag. One can define multiple tags in the same way.
Code #1 : Converting chunks to RegEx Pattern.
Python3
# Laading Library from nltk.chunk.regexp import tag_pattern2re_pattern # Chunk Pattern to RegEx Pattern print ( "Chunk Pattern : " , tag_pattern2re_pattern( '<DT>?<NN.*>+' )) |
Output :
Chunk Pattern : ()?(<(NN[^\{\}]*)>)+
Curly Braces are used to specify a chunk like {} and to specify the chink pattern one can just flip the braces }{. For a particular phrase type, these rules (chunk and a chink pattern) can be combined into grammar.
Code #2 : Parsing the sentence with RegExParser.
Note: To obtain a tree representation of parsed chunks and chinks, install third party `svgling` helper library.
Python3
from nltk.chunk import RegexpParser # Introducing the Pattern chunker = RegexpParser(r ''' NP: {<DT><NN.*><.*>*<NN.*>} }<VB.*>{ ''' ) chunker.parse([( 'the' , 'DT' ), ( 'book' , 'NN' ), ( 'has' , 'VBZ' ), ( 'many' , 'JJ' ), ( 'chapters' , 'NNS' )]) |
Output :
Tree('S', [Tree('NP', [('the', 'DT'), ('book', 'NN')]), ('has', 'VBZ'), Tree('NP', [('many', 'JJ'), ('chapters', 'NNS')])])