Enchant
is a module in Python which is used to check the spelling of a word, gives suggestions to correct words. Also, gives antonym and synonym of words. It checks whether a word exists in dictionary or not.
Enchant
also provides the enchant.tokenize
module to tokenize text. Tokenizing involves splitting words from the body of the text.
Some terms that will be frequently used are :
- Corpus – Body of text, singular. Corpora is the plural of this.
- Lexicon – Words and their meanings.
- Token – Each “entity” that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is “tokenized” into words.
We will be using get_tokenizer()
to tokenize the text. It takes a language code as input and returns the appropriate tokenization class. Then we instantiate this class with some text and it will return an iterator which will yield the words contained in that text.
The items produced by the tokenizer are tuples of the form (WORD, POS), where WORD is the tokenized word and POS is the position of string at which that word is located.
# import the module from enchant.tokenize import get_tokenizer # the text to be tokenized text = ( "Natural language processing (NLP) is a field " + "of computer science, artificial intelligence " + "and computational linguistics concerned with " + "the interactions between computers and human " + "(natural) languages, and, in particular, " + "concerned with programming computers to " + "fruitfully process large natural language " + "corpora. Challenges in natural language " + "processing frequently involve natural " + "language understanding, natural language" + "generation frequently from formal, machine" + "-readable logical forms), connecting language " + "and machine perception, managing human-" + "computer dialog systems, or some combination " + "thereof." ) # getting tokenizer class tokenizer = get_tokenizer( "en_US" ) token_list = [] for words in tokenizer(text): token_list.append(words) # print the words with POS print (token_list) |
Output :
[(‘Natural’, 0), (‘language’, 8), (‘processing’, 17), (‘NLP’, 29), (‘is’, 34), (‘a’, 37), (‘field’, 39), (‘of’, 45), (‘computer’, 48), (‘science’, 57), (‘artificial’, 66), (‘intelligence’, 77), (‘and’, 90), (‘computational’, 94), (‘linguistics’, 108), (‘concerned’, 120), (‘with’, 130), (‘the’, 135), (‘interactions’, 139), (‘between’, 152), (‘computers’, 160), (‘and’, 170), (‘human’, 174), (‘natural’, 181), (‘languages’, 190), (‘and’, 201), (‘in’, 206), (‘particular’, 209), (‘concerned’, 221), (‘with’, 231), (‘programming’, 236), (‘computers’, 248), (‘to’, 258), (‘fruitfully’, 261), (‘process’, 272), (‘large’, 280), (‘natural’, 286), (‘language’, 294), (‘corpora’, 303), (‘Challenges’, 312), (‘in’, 323), (‘natural’, 326), (‘language’, 334), (‘processing’, 343), (‘frequently’, 354), (‘involve’, 365), (‘natural’, 373), (‘language’, 381), (‘understanding’, 390), (‘natural’, 405), (‘languagegeneration’, 413), (‘frequently’, 432), (‘from’, 443), (‘formal’, 448), (‘machine’, 456), (‘readable’, 464), (‘logical’, 473), (‘forms’, 481), (‘connecting’, 489), (‘language’, 500), (‘and’, 509), (‘machine’, 513), (‘perception’, 521), (‘managing’, 533), (‘human’, 542), (‘computer’, 548), (‘dialog’, 557), (‘systems’, 564), (‘or’, 573), (‘some’, 576), (‘combination’, 581), (‘thereof’, 593)]
To only print the words, not the POS :
# print only the words word_list = [] for tokens in token_list: word_list.append(tokens[ 0 ]) print (word_list) |
Output :
[‘Natural’, ‘language’, ‘processing’, ‘NLP’, ‘is’, ‘a’, ‘field’, ‘of’, ‘computer’, ‘science’, ‘artificial’, ‘intelligence’, ‘and’, ‘computational’, ‘linguistics’, ‘concerned’, ‘with’, ‘the’, ‘interactions’, ‘between’, ‘computers’, ‘and’, ‘human’, ‘natural’, ‘languages’, ‘and’, ‘in’, ‘particular’, ‘concerned’, ‘with’, ‘programming’, ‘computers’, ‘to’, ‘fruitfully’, ‘process’, ‘large’, ‘natural’, ‘language’, ‘corpora’, ‘Challenges’, ‘in’, ‘natural’, ‘language’, ‘processing’, ‘frequently’, ‘involve’, ‘natural’, ‘language’, ‘understanding’, ‘natural’, ‘languagegeneration’, ‘frequently’, ‘from’, ‘formal’, ‘machine’, ‘readable’, ‘logical’, ‘forms’, ‘connecting’, ‘language’, ‘and’, ‘machine’, ‘perception’, ‘managing’, ‘human’, ‘computer’, ‘dialog’, ‘systems’, ‘or’, ‘some’, ‘combination’, ‘thereof’]
<!–
–>
Please Login to comment…