NLP | How tokenizing text, sentence, words works

28 July 2024

0

Natural Language Processing (NLP) is a subfield of computer science, artificial intelligence, information engineering, and human-computer interaction. This field focuses on how to program computers to process and analyze large amounts of natural language data. It is difficult to perform as the process of reading and understanding languages is far more complex than it seems at first glance. Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. Key points of the article –

Text into sentences tokenization
Sentences into words tokenization
Sentences using regular expressions tokenization

Code #1: Sentence Tokenization – Splitting sentences in the paragraph

Python3

from nltk.tokenize import sent_tokenize
  
text = "Hello everyone. Welcome to Lazyroar. You are studying NLP article"
sent_tokenize(text)

Output :

['Hello everyone.',
 'Welcome to Lazyroar.',
 'You are studying NLP article']

How sent_tokenize works ? The sent_tokenize function uses an instance of PunktSentenceTokenizer from the nltk.tokenize.punkt module, which is already been trained and thus very well knows to mark the end and beginning of sentence at what characters and punctuation. Code #2: PunktSentenceTokenizer – When we have huge chunks of data then it is efficient to use it.

Python3

import nltk.data
  
# Loading PunktSentenceTokenizer using English pickle file
tokenizer = nltk.data.load('tokenizers/punkt/PY3/english.pickle')
  
tokenizer.tokenize(text)

Output :

['Hello everyone.',
 'Welcome to Lazyroar.',
 'You are studying NLP article']

Code #3: Tokenize sentence of different language – One can also tokenize sentence from different languages using different pickle file other than English.

Python3

import nltk.data
  
spanish_tokenizer = nltk.data.load('tokenizers/punkt/PY3/spanish.pickle')
  
text = 'Hola amigo. Estoy bien.'
spanish_tokenizer.tokenize(text)

Output :

['Hola amigo.', 
 'Estoy bien.']

Code #4: Word Tokenization – Splitting words in a sentence.

Python3

from nltk.tokenize import word_tokenize
  
text = "Hello everyone. Welcome to Lazyroar."
word_tokenize(text)

Output :

['Hello', 'everyone', '.', 'Welcome', 'to', 'Lazyroar', '.']

How word_tokenize works? word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class. Code #5: Using TreebankWordTokenizer

Python3

from nltk.tokenize import TreebankWordTokenizer
  
tokenizer = TreebankWordTokenizer()
tokenizer.tokenize(text)

Output :

['Hello', 'everyone.', 'Welcome', 'to', 'Lazyroar', '.']

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it doesn’t discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing. Code #6: PunktWordTokenizer – It doesn’t separates the punctuation from the words.

Python3

from nltk.tokenize import PunktWordTokenizer
  
tokenizer = PunktWordTokenizer()
tokenizer.tokenize("Let's see how it's working.")

Output :

['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']

Code #6: WordPunctTokenizer – It separates the punctuation from the words.

Python3

from nltk.tokenize import WordPunctTokenizer
  
tokenizer = WordPunctTokenizer()
tokenizer.tokenize("Let's see how it's working.")

Output :

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

Code #7: Using Regular Expression

Python3

from nltk.tokenize import RegexpTokenizer
  
tokenizer = RegexpTokenizer("[\w']+")
text = "Let's see how it's working."
tokenizer.tokenize(text)

Output :

["Let's", 'see', 'how', "it's", 'working']

Code #7: Using Regular Expression

Python3

from nltk.tokenize import regexp_tokenize
  
text = "Let's see how it's working."
regexp_tokenize(text, "[\w']+")

Output :

["Let's", 'see', 'how', "it's", 'working']

NLP | How tokenizing text, sentence, words works

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Python3

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

PureVPN vs. Private Internet Access 2025: Which Is Better? by Gjurgjica Panova

Recent Comments

EDITOR PICKS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR POSTS

5 Best VPNs for Brunei in 2025: Surf & Stream Privately by Raven Wu

NordVPN vs. Mullvad VPN 2025: Which VPN Is Better? by Gjurgjica Panova

Surfshark vs. Atlas VPN 2025: Which VPN Is Better? by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US