What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.
How to create wordlist corpus?
-
WordListCorpusReader class is one of the simplest CorpusReader classes. It
- WordListCorpusReader – It is one of the simplest CorpusReader classes.
- This class provides access to the files that contain list of words or one word per line
- Wordlist file can be a CSV file or a txt file having one word in each line. In our wordlist file
we have added : Lazyroar for Lazyroar welcomes you to nlp articles
- Two arguments to give
- directory path containing the files
- list of filenames
Code #1 : Creating a wordlist corpus
from nltk.corpus.reader import WordListCorpusReader x = WordListCorpusReader( '.' , [ 'C:\\Users\\dell\\Desktop\\wordlist.txt' ]) x.words() x.fileids() |
Output :
['Lazyroar', 'for', 'Lazyroar', 'welcomes', 'you', 'to', 'nlp', 'articles'] ['C:\\Users\\dell\\Desktop\\wordlist.txt']
Code #2 : Accessing raw.
x.raw() from nltk.tokenize import line_tokenize print ( "Wordlist : " , line_tokenize(x.raw())) |
Output :
'Lazyroar\r\nfor\r\nLazyroar\r\nwelcomes\r\nyou\r\nto\r\nnlp\r\narticles' Wordlist : ['Lazyroar', 'for', 'Lazyroar', 'welcomes', 'you', 'to', 'nlp', 'articles']
Code #3 : Accessing Name Wordlist corpus
# Accessing pre-defined wordlist from nltk.corpus import names print ( "Path : " , names.fileids()) print ( "\nNo. of female names : " , len (names.words( 'female.txt' ))) print ( "\nNo. of male names : " , len (names.words( 'male.txt' ))) |
Output :
Path : ['female.txt', 'male.txt'] No. of female names : 5001 No. of male names : 2943
Code #4 : Accessing English Wordlist corpus
# Accessing pre-defined wordlist from nltk.corpus import words print ( "File : " , words.fileids()) print ( "\nNo. of female names : " , len (words.words( 'en-basic' ))) print ( "\nNo. of male names : " , len (words.words( 'en' ))) |
Output :
File : ['en', 'en-basic'] No. of female names : 850 No. of male names : 235886