Text Preprocessing in Python | Set 2

27 July 2024

2

Prerequisite: Introduction to NLP, Text Preprocessing in Python | Set 1

In the previous post, we saw the basic preprocessing steps when working with textual data. In this article, we will look at some more advanced text preprocessing techniques. We can use these techniques to gain more insights into the data that we have.

Let’s import the necessary libraries.

# import the necessary libraries 
import nltk 
import string 
import re 

Part of Speech Tagging:

The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.

from nltk.tokenize import word_tokenize 
from nltk import pos_tag 
  
# convert text into word_tokens with their tags 
def pos_tagging(text): 
    word_tokens = word_tokenize(text) 
    return pos_tag(word_tokens) 
  
pos_tagging('You just gave me a scare') 

Example:

Input: ‘You just gave me a scare’
Output: [(‘You’, ‘PRP’), (‘just’, ‘RB’), (‘gave’, ‘VBD’), (‘me’, ‘PRP’),
(‘a’, ‘DT’), (‘scare’, ‘NN’)]

In the given example, PRP stands for personal pronoun, RB for adverb, VBD for verb past tense, DT for determiner and NN for noun. We can get the details of all the part of speech tags using the Penn Treebank tagset.

# download the tagset  
nltk.download('tagsets') 
  
# extract information about the tag 
nltk.help.upenn_tagset('NN') 

Example:

Input: ‘NN’
Output: NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist …

Chunking:

Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.

from nltk.tokenize import word_tokenize  
from nltk import pos_tag 
  
# define chunking function with text and regular 
# expression representing grammar as parameter 
def chunking(text, grammar): 
    word_tokens = word_tokenize(text) 
  
    # label words with part of speech 
    word_pos = pos_tag(word_tokens) 
  
    # create a chunk parser using grammar 
    chunkParser = nltk.RegexpParser(grammar) 
  
    # test it on the list of word tokens with tagged pos 
    tree = chunkParser.parse(word_pos) 
      
    for subtree in tree.subtrees(): 
        print(subtree) 
    tree.draw() 
      
sentence = 'the little yellow bird is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar) 

In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

Libraries like spaCy and Textblob are more suited for chunking.

Example:

Input: ‘the little yellow bird is flying in the sky’
Output:
(S
(NP the/DT little/JJ yellow/JJ bird/NN)
is/VBZ
flying/VBG
in/IN
(NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)

Named Entity Recognition:

Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.

from nltk.tokenize import word_tokenize 
from nltk import pos_tag, ne_chunk 
  
def named_entity_recognition(text): 
    # tokenize the text 
    word_tokens = word_tokenize(text) 
  
    # part of speech tagging of words 
    word_pos = pos_tag(word_tokens) 
  
    # tree of word entities 
    print(ne_chunk(word_pos)) 
  
text = 'Bill works for Lazyroar so he went to Delhi for a meetup.'
named_entity_recognition(text) 

Example:

Input: ‘Bill works for Lazyroar so he went to Delhi for a meetup.’
Output:
(S
(PERSON Bill/NNP)
works/VBZ
for/IN
(ORGANIZATION Lazyroar/NNP)
so/RB
he/PRP
went/VBD
to/TO
(GPE Delhi/NNP)
for/IN
a/DT
meetup/NN
./.)

Last Updated :
29 May, 2019

<!–

–>

Text Preprocessing in Python | Set 2

Part of Speech Tagging:

Chunking:

Named Entity Recognition:

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

NordVPN Not Working in China? Try These Tips by Tim Mocan

Interview with Ihor Demkovych – Chief Security Officer and Head of Engineering at Geniusee by Shauli Zacks

6 Best (REALLY FREE) iPad & iPhone Antivirus Apps in 2025 by Katarina Glamoslija

The Evolution of Phishing Attacks and How to Combat Them Copy by

Recent Comments

EDITOR PICKS

NordVPN Not Working in China? Try These Tips by Tim Mocan

Interview with Ihor Demkovych – Chief Security Officer and Head of Engineering at Geniusee by Shauli Zacks

6 Best (REALLY FREE) iPad & iPhone Antivirus Apps in 2025 by Katarina Glamoslija

POPULAR POSTS

NordVPN Not Working in China? Try These Tips by Tim Mocan

Interview with Ihor Demkovych – Chief Security Officer and Head of Engineering at Geniusee by Shauli Zacks

6 Best (REALLY FREE) iPad & iPhone Antivirus Apps in 2025 by Katarina Glamoslija

POPULAR CATEGORY

ABOUT US

FOLLOW US

Text Preprocessing in Python | Set 2

Part of Speech Tagging:

Chunking:

Named Entity Recognition:

Please Login to comment…

LEAVE A REPLY Cancel reply

Most Popular

Recent Comments

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US