NLP | Custom corpus

27 July 2024

0

What is a corpus?
A corpus can be defined as a collection of text documents. It can be thought as just a bunch of text files in a directory, often alongside many other directories of text files.

How it is done ?
NLTK already defines a list of data paths or directories in nltk.data.path. Our custom corpora must be present within any of these given paths so it can be found by NLTK.
We can also create a custom nltk_data directory in our home directory and verify that it is in the list of known paths specified by nltk.data.path.

Code #1 : Creating a custom directory and verify.

# importing libraries 
import os, os.path 
  
# using the given path 
path = os.path.expanduser('~/nltk_data') 
  
# checking 
if not os.path.exists(path): 
    os.mkdir(path) 
      
print ("Does path exists : ", os.path.exists(path)) 
  
  
import nltk.data 
print ("\nDoes path exists in nltk : ",  
       path in nltk.data.path) 

Output :

Does path exists : True
Does path exists in nltk : True

Code #2 : Creating a wordlist file.

# loading libraries 
import nltk.data 
  
nltk.data.load('corpora/cookbook/word_file.txt', format ='raw') 

Output :

b'nltk\n'

How all this works ?

nltk.data.load() recognizes the formats – ‘raw’, ‘pickle’ and ‘yaml’.
It guess the format based on the file’s extension, if format is not given.
As in the code above, ‘raw’ format is needed to be specified.
As in the code above, ‘raw’ format is needed to be specified.
If file ends in ‘.yaml’, then no need to specify the format.

Code #3 : How to load a YAML file

import nltk.data 
  
# loading file using the path 
nltk.data.load('corpora/cookbook/synonyms.yaml') 

Output :

{'bday': 'birthday'}

NLP | Custom corpus

Java Program for Longest Common Subsequence

Maximum height of Tree when any Node can be considered as Root

Print Fibonacci sequence using 2 variables

LEAVE A REPLY Cancel reply

Most Popular

Interview with Peter Warren, Chair of the Cyber Security Research Institute by Shauli Zacks

How Much Does Proton VPN Cost? Pricing & Plans in 2024 by Tim Mocan

How to Unblock Movie Sites at School in 2024: Works 100% by Gjurgjica Panova

How Much Does Surfshark Cost? All Special Offers by Gjurgjica Panova

Recent Comments

EDITOR PICKS

Interview with Peter Warren, Chair of the Cyber Security Research Institute by Shauli Zacks

How Much Does Proton VPN Cost? Pricing & Plans in 2024 by Tim Mocan

How to Unblock Movie Sites at School in 2024: Works 100% by Gjurgjica Panova

POPULAR POSTS

Interview with Peter Warren, Chair of the Cyber Security Research Institute by Shauli Zacks

How Much Does Proton VPN Cost? Pricing & Plans in 2024 by Tim Mocan

How to Unblock Movie Sites at School in 2024: Works 100% by Gjurgjica Panova

POPULAR CATEGORY

ABOUT US

FOLLOW US