Python – Tokenize text using Enchant

20 June 2025

2

Enchant is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.

Enchant also provides the enchant.tokenize module to tokenize text. Tokenizing involves splitting words from the body of the text.

Some terms that will be frequently used are :

Corpus – Body of text, singular. Corpora is the plural of this.
Lexicon – Words and their meanings.
Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words.

We will be using get_tokenizer() to tokenize the text. It takes a language code as input and returns the appropriate tokenization class. Then we instantiate this class with some text and it will return an iterator which will yield the words contained in that text.
The items produced by the tokenizer are tuples of the form (WORD, POS), where WORD is the tokenized word and POS is the position of string at which that word is located.

# import the module 
from enchant.tokenize import get_tokenizer 
  
# the text to be tokenized  
text = ("Natural language processing (NLP) is a field " + 
       "of computer science, artificial intelligence " + 
       "and computational linguistics concerned with " +  
       "the interactions between computers and human " +  
       "(natural) languages, and, in particular, " +  
       "concerned with programming computers to " + 
       "fruitfully process large natural language " +  
       "corpora. Challenges in natural language " +  
       "processing frequently involve natural " + 
       "language understanding, natural language" +  
       "generation frequently from formal, machine" +  
       "-readable logical forms), connecting language " +  
       "and machine perception, managing human-" + 
       "computer dialog systems, or some combination " +  
       "thereof.") 
  
# getting tokenizer class 
tokenizer = get_tokenizer("en_US") 
  
token_list =[] 
for words in tokenizer(text): 
    token_list.append(words) 
  
# print the words with POS 
print(token_list) 

Output :

[(‘Natural’, 0), (‘language’, 8), (‘processing’, 17), (‘NLP’, 29), (‘is’, 34), (‘a’, 37), (‘field’, 39), (‘of’, 45), (‘computer’, 48), (‘science’, 57), (‘artificial’, 66), (‘intelligence’, 77), (‘and’, 90), (‘computational’, 94), (‘linguistics’, 108), (‘concerned’, 120), (‘with’, 130), (‘the’, 135), (‘interactions’, 139), (‘between’, 152), (‘computers’, 160), (‘and’, 170), (‘human’, 174), (‘natural’, 181), (‘languages’, 190), (‘and’, 201), (‘in’, 206), (‘particular’, 209), (‘concerned’, 221), (‘with’, 231), (‘programming’, 236), (‘computers’, 248), (‘to’, 258), (‘fruitfully’, 261), (‘process’, 272), (‘large’, 280), (‘natural’, 286), (‘language’, 294), (‘corpora’, 303), (‘Challenges’, 312), (‘in’, 323), (‘natural’, 326), (‘language’, 334), (‘processing’, 343), (‘frequently’, 354), (‘involve’, 365), (‘natural’, 373), (‘language’, 381), (‘understanding’, 390), (‘natural’, 405), (‘languagegeneration’, 413), (‘frequently’, 432), (‘from’, 443), (‘formal’, 448), (‘machine’, 456), (‘readable’, 464), (‘logical’, 473), (‘forms’, 481), (‘connecting’, 489), (‘language’, 500), (‘and’, 509), (‘machine’, 513), (‘perception’, 521), (‘managing’, 533), (‘human’, 542), (‘computer’, 548), (‘dialog’, 557), (‘systems’, 564), (‘or’, 573), (‘some’, 576), (‘combination’, 581), (‘thereof’, 593)]

To only print the words, not the POS :

# print only the words 
word_list =[] 
  
for tokens in token_list: 
    word_list.append(tokens[0]) 
print(word_list) 

Output :

[‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘natural’, ‘languages’, ‘and’, ‘in’, ‘particular’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘natural’, ‘languagegeneration’, ‘frequently’, ‘from’, ‘formal’, ‘machine’, ‘readable’, ‘logical’, ‘forms’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘managing’, ‘human’, ‘computer’, ‘dialog’, ‘systems’, ‘or’, ‘some’, ‘combination’, ‘thereof’]

Last Updated :
26 May, 2020

<!–

–>

2 COMMENTS

Elvis Frog 3 February 2026 At 1:57 am

… [Trackback]

[…] Read More to that Topic: geeksforgeeks.org/python-tokenize-text-using-enchant/ […]

Log in to leave a comment
beteazy24 14 March 2026 At 9:51 pm

… [Trackback]

[…] Find More Information here to that Topic: geeksforgeeks.org/python-tokenize-text-using-enchant/ […]

Log in to leave a comment

Python – Tokenize text using Enchant

Working with Titles and Heading – Python docx Module

Creating a Receipt Calculator using Python

One Liner for Python if-elif-else Statements

2 COMMENTS

LEAVE A REPLY Cancel reply

Most Popular

Android’s next major update will change how you multitask

Android’s new sideloading delay won’t be as frustrating as you feared

Samsung hands amazing new customization options to One UI 8.5 phones

My default phone recommendation. [Video]

EDITOR PICKS

Android’s next major update will change how you multitask

Android’s new sideloading delay won’t be as frustrating as you feared

Samsung hands amazing new customization options to One UI 8.5 phones

POPULAR POSTS

Android’s next major update will change how you multitask

Android’s new sideloading delay won’t be as frustrating as you feared

Samsung hands amazing new customization options to One UI 8.5 phones

POPULAR CATEGORY

ABOUT US

FOLLOW US

Python – Tokenize text using Enchant

Please Login to comment…

2 COMMENTS

LEAVE A REPLY Cancel reply

Most Popular

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY

ABOUT US

FOLLOW US