Hello all and welcome to the second of the series – NLP with NLTK. The first of the series can be found here, incase you have missed.
In this article we will talk about basic NLP concepts and use NLTK to implement the concepts.
Contents:
- Corpus
- Tokenization/Segmentation
- Frequency Distribution
- Conditional Frequency Distribution
- Normalization
- Zipf’s law
- Stemming
- Edit Distance/Text Similarity
- Micro Project – Fuzzy matching of two sentences
1. CORPUS:
A corpus is a collection of machine readable text that is sampled to represent a natural language or language variety. Corpus denotes the technical term for large bodies of digital text. The plural form of corpus is known as corpora. Corpora plays an essential role in NLP research as well as other investigations related to linguistics.
While converting written text to machine readable form, there lies another form of preparatory work which is beyond the basic text processing of transcribing audio/video speech known as corpus markup. Corpus markup is a system of standard codes inserted into a document stored in electronic form to provide information about the text itself (i.e., text metadata) and govern formatting, printing or other processing (i.e., structural organization). Metadata markup can be embedded in the same document or stored in a separate but linked document. Structural markup has to be embedded in the text.
Metadata and structural markup are important for corpus because –
- Metadata is required to restore the contextual information of these sampled corpus. This would help us to relate the specimen to its original linguistic habitat.
- Filenames – even if they provide very little extra-textual information and no textual information they are of great importance to NLP researchers and should be encoded separately from the corpus. This markup add value to the corpus and answers a broad range of research questions.
- Preprocessing written data – when quotations in foreign language; when images and tables are removed from a document, when placeholders are to be inserted to indicate locations; when the type of omissions and transcribing audio/video data – when there are pauses and paralinguistic features – all these things need to be marked up. Corpus markup has a clear connection with existing linguistic transcription practice.
In order to extract linguistic information from a corpus – they must first be encoded in the corpus, by a process that is technically known as “corpus annotation.”
Text corpora contains rich metadata which is useful in deriving valuable insights. These corpora can be annotated in various forms and at different levels like –
- Phonological annotation: Syllable boundaries – Phonetic/Phonemic annotation; Prosodic features – prosodic annotation.
- Morphological annotation: Corpora can be annotated in terms prefixes, suffixes, stems
- Lexical annotation: Corpora can be annotated in terms of Parts-of-speech tagging, lemmas and semantics
- Syntactic annotation: Corpora can be annotated in terms of syntactic analysis – parsing, treebanking or bracketing.
- Discourse annotation: At this level, corpora can be annotated to show anaphoric relations – coreference annotation, pragmatic annotation – pragmatic information such as speech acts, stylistic annotation – stylistic feature such as speech and thought presentation.
NLTK has some famous inbuilt corpus. These can be found on the nltk data folder that we have just downloaded in the previous article. In order to access NLTK provided corpora, we need to call the nltk.corpus module.
2. TOKENIZATION / SEGMENTATION:
Text segmentation is the process of converting a well-defined text corpus into its component words and sentences. Word segmentation breaks up the sequence of characters in a text by locating the word boundaries – the points where one word ends and another begins. For computational linguistics purposes, the words identified as a result of segmentation are referred to as tokens, and word segmentation is also known as tokenization. Hence, tokenization divides strings into lists of substrings.
Sentence segmentation is the process of determining the longer processing units (sentences) consisting of one or more words. This task involves identifying sentence boundaries between words in different sentences. Since most written languages have punctuation marks that occur at sentence boundaries, sentence segmentation is frequently referred to as sentence boundary detection, sentence boundary disambiguation or sentence boundary recognition. All these terms refer to the same task: determining how a text should be divided into sentences for further processing.
Demo:
# Example 1: word_tokenize() text = "Hi there! I'm going out for shopping. Would you like to come?" from nltk.tokenize import word_tokenize word_token = word_tokenize(text) word_token Out[1]: ['Hi', 'there', '!', 'I', "'m", 'going', 'out', 'for', 'shopping', '.', 'Would', 'you', 'like', 'to', 'come', '?']
Individual sentences are tokenized into words. Word tokenization is performed using word_tokenize() function. This uses an instance of TreebankWordTokenizer. NLTK by default uses word_tokenize() to perform word tokenization. Notice the output – the fullstop after the word shopping is taken as a separate entity and tokenized accordingly.
# Example 2: TreebankWordTokenizer() from nltk.tokenize import TreebankWordTokenizer treebank_word = TreebankWordTokenizer() treebank_word.tokenize(text) Out[2]: ['Hi', 'there', '!', 'I', "'m", 'going', 'out', 'for', 'shopping.', 'Would', 'you', 'like', 'to', 'come', '?']
Word tokenization can also be performed loading TreebankWordTokenizer and the calling tokenizer() function. This instance of NLTK has been pre-trained to perform sentence tokenization to words based on spaces and punctuations. Notice the output – in case of TreebankWordTokenizer – the fullstop after the word shopping is taken together and tokenized accordingly.
# Example 3: WordPunctTokenizer() from nltk.tokenize import WordPunctTokenizer word_punct_tokenize = WordPunctTokenizer() word_punct_tokenize.tokenize(text) Out[3]: ['Hi', 'there', '!', 'I', "'", 'm', 'going', 'out', 'for', 'shopping', '.', 'Would', 'you', 'like', 'to', 'come', '?']
This is another type of word tokenizer present in NLTK. This works by splitting punctuation separately from the words.
Sentence Tokenizer:
# loading an external corpus in nltk paragraph = open('C:/Users/Sukanya/Desktop/paragraph.txt') # reading from the external corpus p = paragraph.read() # the contents of the corpus p Out[4]: "Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed. "
By default NLTK uses sent_tokenize() to tokenize sentences.
# Example 1: using sent_tokenize() from nltk.tokenize import sent_tokenize sentence_tokenize = sent_tokenize(p) sentence_tokenize Out[5]: ['Python is an interpreted, object-oriented, high-level programming language with dynamic semantics.', 'Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.', "Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance.", 'Python supports modules and packages, which encourages program modularity and code reuse.', 'The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed. ']
Paragraphs can be tokenized on the basis of sentence. For this sent_tokenize() uses an instance of NLTK – PunktSentenceTokenizer. This has been trained to perform tokenization on different European languages based on punctuations that starts and ends a sentence.
# Example 2: using PunktSentenceTokenizer tokenizer = punkt_sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle') tokenizer.tokenize(p) Out[108]: ['Python is an interpreted, object-oriented, high-level programming language with dynamic semantics.', 'Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together.', "Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance.", 'Python supports modules and packages, which encourages program modularity and code reuse.', 'The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed. ']
PunktSentenceTokenizer is a sentence boundary detection algorithm. This is implemented from Kiss and Strunk, “ Unsupervised Multilingual Sentence Boundary Detection“. This is an abstract class of the default NLTK sentence tokenizer. For large body of text it is useful to use this algorithm.
3. FREQUENCY DISTRIBUTION:
Frequency distribution is a measure of each vocabulary in a given set of text. This tells us the frequency or the number of times a particular word has appeared in the text.
Demo:
from nltk import FreqDist verbs=["should", "may", "can"] genres=["news", "government", "romance"] from nltk.corpus import brown for g in genres: words=brown.words(categories=g) freq=FreqDist([w.lower() for w in words if w.lower() in verbs]) print g, freq news <FreqDist with 3 samples and 248 outcomes> government <FreqDist with 3 samples and 411 outcomes> romance <FreqDist with 3 samples and 123 outcomes> freq Out[7]: FreqDist({u'can': 79, u'may': 11, u'should': 33}) freq.plot()
4. CONDITIONAL FREQUENCY DISTRIBUTION:
Conditional frequency distribution refers to groups of frequency distribution based on different entity or condition such as an event (say war) or a place (say America)
Demo:
# using NLTK's inaugural corpus - collection of 56 texts, one for each of America's # presidential address from nltk.corpus import inaugural #fetching all file names from inaugural corpus inaugural.fileids() Out[5]: [u'1789-Washington.txt', u'1793-Washington.txt', u'1797-Adams.txt', u'1801-Jefferson.txt', u'1805-Jefferson.txt', u'1809-Madison.txt', u'1813-Madison.txt', u'1817-Monroe.txt', u'1821-Monroe.txt', u'1825-Adams.txt', u'1829-Jackson.txt', u'1833-Jackson.txt', u'1837-VanBuren.txt', u'1841-Harrison.txt', u'1845-Polk.txt', u'1849-Taylor.txt', u'1853-Pierce.txt', u'1857-Buchanan.txt', u'1861-Lincoln.txt', u'1865-Lincoln.txt', u'1869-Grant.txt', u'1873-Grant.txt', u'1877-Hayes.txt', u'1881-Garfield.txt', u'1885-Cleveland.txt', u'1889-Harrison.txt', u'1893-Cleveland.txt', u'1897-McKinley.txt', u'1901-McKinley.txt', u'1905-Roosevelt.txt', u'1909-Taft.txt', u'1913-Wilson.txt', u'1917-Wilson.txt', u'1921-Harding.txt', u'1925-Coolidge.txt', u'1929-Hoover.txt', u'1933-Roosevelt.txt', u'1937-Roosevelt.txt', u'1941-Roosevelt.txt', u'1945-Roosevelt.txt', u'1949-Truman.txt', u'1953-Eisenhower.txt', u'1957-Eisenhower.txt', u'1961-Kennedy.txt', u'1965-Johnson.txt', u'1969-Nixon.txt', u'1973-Nixon.txt', u'1977-Carter.txt', u'1981-Reagan.txt', u'1985-Reagan.txt', u'1989-Bush.txt', u'1993-Clinton.txt', u'1997-Clinton.txt', u'2001-Bush.txt', u'2005-Bush.txt', u'2009-Obama.txt'] cfd = nltk.ConditionalFreqDist( ...: (target, fileid) ...: for fileid in inaugural.fileids() ...: for w in inaugural.words(fileid) ...: for target in ['america', 'citizen'] ...: if w.lower().startswith(target)) print cfd <ConditionalFreqDist with 2 conditions> cfd.plot()
# a different condition - cfd = nltk.ConditionalFreqDist( ...: (target, fileid) ...: for fileid in inaugural.fileids() ...: for w in inaugural.words(fileid) ...: for target in ['america', 'war'] ...: if w.lower().startswith(target)) cfd.plot()
5. NORMALIZATION:
Text normalization is a step that involves merging different written forms of a token into a canonical normalized form; for example, a document may contain the equivalent tokens “Mr.”, “Mr”, “mister”, and “Mister” that would all be normalized to a single form.
In other words, normalization is a process that converts a list of words to a more uniform sequence. This is helpful for preparing text for later process. Converting words into standard format ensures that other operations are able to work with the corpus and will not have to deal with the issues that might compromise the process.
# List of NLTK stopwords collection in english print stopwords.words("english") [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
These are the collection of english stopwords by NLTK – I , me, my etc.
Demo:
# removing stopwords text = "Hi there! I'm going out for shopping. Would you like to come?" from nltk.corpus import stopwords from nltk.tokenize import word_tokenize stop_words = set(stopwords.words("english")) tokens = word_tokenize(text) tokens Out[7]: ['Hi', 'there', '!', 'I', "'m", 'going', 'out', 'for', 'shopping', '.', 'Would', 'you', 'like', 'to', 'come', '?'] filtered_sentence = [w for w in tokens if not w in stop_words] filtered_sentence Out[9]: ['Hi', '!', 'I', "'m", 'going', 'shopping', '.', 'Would', 'like', 'come', '?']
You can observe from the output that the stopwords from the sentence has been removed.
# remove punctuation text = "Hi there! I'm going out for shopping. Would you like to come?" from nltk.tokenize import RegexpTokenizer token = RegexpTokenizer(r'((?<=[^ws])w(?=[^ws])|(W))+', gaps=True) token.tokenize(text) Out[13]: ['Hi', ' ', ' ', 'there', ' ', ' ', 'I', "'", "'", 'm', ' ', ' ', 'going', ' ', ' ', 'out', ' ', ' ', 'for', ' ', ' ', 'shopping', ' ', ' ', 'Would', ' ', ' ', 'you', ' ', ' ', 'like', ' ', ' ', 'to', ' ', ' ', 'come', '?', '?']
You can see that the punctuation has been separately normalized based on the regular expression specified in the above code.
6. ZIPF’S LAW:
According to this law – for any given word we can establish the following relation:
fr = k,
Where,
f is the frequency of that word,
r is the rank or the position of the word in the sorted list, and
k is a constant.
If we take up a large corpus and count up the number of times each word occurs in that corpus and then list those words starting from higher frequency order, we would observe that there is a relationship between the frequency of a word and its position in the list.
This mathematical relationship may not hold exactly true, but still it is useful to describe how words are distributed in human language.
Demo:
# for plotting within the notebook %matplotlib inline # for plotting import matplotlib # import pyplot import matplotlib.pyplot as plt # importing nltk into the jupyter notebook import nltk # import gutenberg from nltk corpus inside nltk_data (sample data downloaded) from nltk.corpus import gutenberg # import FreqDist class from nltk import FreqDist # create frequency distribution object fd = FreqDist() # Count each token in each text of the Gutenberg collection for text in gutenberg.fileids(): for word in gutenberg.words(text): fd[word] += 1 # Initializing two empty lists which will hold ranks and frequencies ranks = [] freqs = [] # Generate a (rank, frequency) point for each counted token # and append to the respective lists, Note that the iteration # over fd is automatically sorted. for rank, word in enumerate(fd): ranks.append(rank+1) freqs.append(fd[word]) # Plotting rank vs frequency on a log-log plot and show the plot plt.loglog(ranks, freqs) plt.xlabel('Frequency(F)', fontsize=20, fontweight='bold', fontstyle = 'italic') plt.ylabel('Rank(R)', fontsize=20, fontweight='bold', fontstyle = 'italic') plt.show()
7. STEMMING:
Stemming is a process of obtaining the root word (stem) from a given word by eliminating its affixes. Stemming is a part of normalization process, and is used extensively for information retrieval process. The use of stems for searching has the advantage of increasing recall by retrieving terms that have the same roots but different endings. While searching with stems, it is possible to retrieve many irrelevant terms which have the same roots but are not related to the topic of search. For accurate information retrieval, the search term should be as long as necessary to achieve precision but short enough to increase recall.
Word Stem example –
Demo:
# porter stemmer import nltk from nltk.stem import PorterStemmer porter_stem = PorterStemmer() stem = porter_stem.stem('I am writing an article on natural language processing') stem Out[5]: u'I am writing an article on natural language process' #stem = porter_stem.stem('writing') stem Out[7]: u'write' #stem = porter_stem.stem('am writing') stem Out[9]: u'am writ' #stem = porter_stem.stem('I am writing') stem Out[11]: u'I am writ'
The Porter Stemming algorithm is used as a normalization process while setting up Information Retrieval systems. This works by eliminating the common morphological and inflexional endings from English words. You can read more about the Porter Stemmer here.
# LancasterStemmer also known as Paice - Husk stemmer from nltk.stem import LancasterStemmer lan_stem = LancasterStemmer() stem = lan_stem.stem('I am writing an article on natural language processing') stem Out[16]: 'i am writing an article on natural language processing' stem = lan_stem.stem('writing') stem Out[20]: 'writ' stem = lan_stem.stem('processing') stem Out[22]: 'process' stem = lan_stem.stem('ODSC rocks') stem Out[24]: 'odsc rocks'
The Paice – Husk stemmer is developed by Lancaster University and uses a rule execution mechanism and externally stored rules as compared to Porter Stemmer which uses algorithmic rules rather than externally stored rules. This feature of flexibility – allowing to specify new set of rules without making any major change in programming made Lancaster Stemmer more attractive than the Porter Stemmer.
# building your own stemmer using Regexp stemmer from nltk.stem import RegexpStemmer regexp_stem = RegexpStemmer('ing$|s$') stem = regexp_stem.stem('writing') stem Out[40]: u'writ' stem = regexp_stem.stem('writes') stem Out[42]: u'write'
You can build your own stemmer using regular expression. The affixes of the stem are used as regular expressions, and on applying stemming operations, you get the stem of the word and the affixes that were used as regular expressions are removed (as seen from the example above). Regexp Stemmer is used where Porter or Lancaster Stemmer does not yield appropriate stemming results.
# SnowBall stemmer from nltk.stem import SnowballStemmer #languages supported by snowball stemmer SnowballStemmer.languages Out[44]: (u'danish', u'dutch', u'english', u'finnish', u'french', u'german', u'hungarian', u'italian', u'norwegian', u'porter', u'portuguese', u'romanian', u'russian', u'spanish', u'swedish') # selecting italian language italian_stem = SnowballStemmer('italian') # stemming an italian word stem = italian_stem.stem('cenerentola') stem Out[48]: u'cenerentol' # selecting english language english_stem = SnowballStemmer('english') # stemming english word stem = english_stem.stem('written') stem Out[53]: u'written' stem = english_stem.stem('writing') stem Out[55]: u'writ'
Snowball stemmer is available in 15 languages besides english and hence is more versatile. While performing stemming with this algorithm, first, we need to create an instance of the language that we want to use and then perform the stemming operations. Snowball stemmer is mostly used because it gives more accurate results than Porter Stemmer and it was developed with the objective of addressing the flaws of Porter Stemmer. You can read more about the snowball stemmer here.
Below is a code sample showing the difference in results by Snowball Stemmer and Porter Stemmer.
# Snowball english stemmer is better than the original porter stemmer stem = english_stem.stem('generously') stem Out[57]: u'generous' stem = porter_stem.stem('generously') stem Out[60]: u'gener'
8. EDIT DISTANCE/TEXT SIMILARITY:
Edit distance is a way of quantifying how dissimilar two texts (strings) are to one another by counting the minimum number of edit operations (insert, delete, substitution) required to transform one string to the other.
8.1 LEVENSHTEIN EDIT DISTANCE:
This is used to measure the distance between two sequence of strings based on their difference and belongs to the family of edit distance-based metrics. This can be defined as the minimum number of edits – insertion, deletion and substitution required to convert one word to another. NLTK uses this method as the default edit distance method.
Demo:
# Levenshtein edit distance / Ethe edit distance import nltk from nltk.metrics import * edit_distance("writing","coding") Out[3]: 4
In order to measure the similarity between two strings, we will refer to as the source string (s) and the target string (t). The distance is the number of deletions, insertions, or substitutions required to transform s into t. For example:
If s is “test” and t is “test”, then LD(s,t) = 0, because no transformations are needed. The strings are already identical.
If s is “test” and t is “tent”, then LD(s,t) = 1, because one substitution (change “s” to “n”) is sufficient to transform s into t.
The greater the Levenshtein distance, the more different the strings are.
Lets have a understanding of how the algorithm works:
Step 1:
Set n to be the length of s.
Set m to be the length of t.
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0…..m rows and 0……n columns.
Step 2:
Initialize the first row to 0…..n.
Initialize the first column to 0……m.
Step 3:
Examine each character of s (i from 1 to n).
Step 4:
Examine each character of t (j from 1 to m).
Step 5:
If s[i] equals t[j], the cost is 0.
If s[i] doesn’t equal t[j], the cost is 1.
Step 6:
Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
Step 7:
After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m].
8.2 JACCARD’S SIMILARITY COEFFICIENT:
Also known as Intersection over Union is a statistical method used for comparing the similarity and diversity of texts. This works by comparing members of two sets to differentiate between distinct and shared members. It is a measure of the similarity for the two sets of data within a range of 0℅ to 100℅ with the highest percentage yielding more similarity.
The formula to find the index is:
Jaccard index = (the number in both sets)/(the number in either set) * 100
Its formula notation is:
J(X,Y) = |X ⋂ Y| / |X ⋃ Y|
Demo:
# Jaccard distance X = set([15,16,17,18]) Y = set([17,19,20]) jaccard_distance(X,Y) Out[6]: 0.8333333333333334
How does algorithm work:
- Count the number of members which are shared between both the sets.
- Count the total number of members in both sets (shared and unshared).
- Divide the number of shared members (1) by the total number of members (2).
- Multiply the number you found in (3) by 100
9. MICRO PROJECT
NATURAL LANGUAGE PROCESSING – FUZZY MATCHING OF TWO SENTENCES
To summarize, we will go through a micro project. In this project we will see the similarity of two sentences, both intending the same meaning but are expressed in different voices. Let’s get to the code.
# NATURAL LANGUAGE PROCESSING - FUZZY MATCHING OF TWO SENTENCES# In this micro project we will see the similarity of two sentences, both intending the same meaning but are expressed in
# different voices.# assigning a variable 'a' to first sentence
a = "If I don't buy some new music every month, I get bored with my collection."# assigning a variable 'b' to second sentence
b = "I get bored with my collection so I buy some new music every month."# importing NLTK library and its module to perform basic text operations
import nltk
from nltk.tokenize import WordPunctTokenizer #tokenizer
from nltk.stem import SnowballStemmer #stemmer
from nltk.corpus import stopwords #stopwordsstop_words = set(stopwords.words("english")) #setting and selecting stopwords to be in english
tokenizer = WordPunctTokenizer() #assigning WordPunctTokenizer function to be a variable.
#This is required because we can't specify the texts inside WordPunctTokenizer() - like this --> WordPunctTokenizer(text).
#Doing this would give error. Check the below code set to understand the error -# import nltk
# text = "Hi there! I'm going out for shopping. Would