According to McKinsey, NLP accelerates the synthesis of various non-automated processes by 60% and has a great impact especially in the field of healthcare. Natural Language Processing (NLP) has become an important part of modern systems. It is used extensively in search engines, conversational interfaces, document processors, and so on. Machines can handle structured data well, but when it comes to working with free-form text, they have a hard time. The goal of NLP is to develop algorithms that enable computers to understand free form text and eventually understand language.
[Natural Language Processing Guide: 30 Free ODSC Resources to Learn NLP]
This article is an excerpt from the book that Artificial Intelligence with Python, Second Edition by Alberto Artasanchez and Prateek Joshi, a completely updated and revised edition of the bestselling guide to artificial intelligence, updated to Python 3.8 and TensorFlow 2, with seven new chapters that cover RNNs, AI & Big Data, fundamental use cases, machine learning data pipelines, chatbots, Big Data, and more. This article explains the key concepts of the bag-of-words model.
Dividing text data into chunks
Text data usually needs to be divided into pieces for further analysis. This process is known as chunking. This is used frequently in text analysis. The conditions that are used to divide the text into chunks can vary based on the problem at hand. This is not the same as tokenization, where the text is also divided into pieces. During chunking, we do not adhere to any constraints, except for the fact that the output chunks need to be meaningful.
When we deal with large text documents, it becomes important to divide the text into chunks to extract meaningful information. In this section, we will see how to divide input text into several pieces.
Create a new Python file and import the following packages:
import numpy as np from nltk.corpus import brown
Define a function to divide the input text into chunks. The first parameter is the text, and the second parameter is the number of words in each chunk:
# Split the input text into chunks, where # each chunk contains N words def chunker(input_data, N): input_words = input_data.split(' ') output = []
Iterate through the words and divide them into chunks using the input parameter. The function returns a list:
cur_chunk = [] count = 0 for word in input_words: cur_chunk.append(word) count += 1 if count == N: output.append(' '.join(cur_chunk)) count, cur_chunk = 0, [] output.append(' '.join(cur_chunk)) return output
Define the main function and read the input data using the Brown corpus. We will read 12000 words in this case. You are free to read as many words as you want:
if __name__=='__main__': # Read the first 12000 words from the Brown corpus input_data = ' '.join(brown.words()[:12000])
Define the number of words in each chunk:
# Define the number of words in each chunk chunk_size = 700
Divide the input text into chunks and display the output:
chunks = chunker(input_data, chunk_size) print('\nNumber of text chunks =', len(chunks), '\n') for i, chunk in enumerate(chunks): print('Chunk', i+1, '==>', chunk[:50])
The full code is given in the file text_chunker.py. If you run the code, you will get the following output:
The preceding screenshot shows the first 50 characters of each chunk.
Now that we have explored techniques to divide and chunk the text, let’s start looking at methods to start performing text analysis.
Extracting the frequency of terms using the Bag of Words model
One of the main goals of text analysis is to convert text into a numerical form so that we can use machine learning on it. Let’s consider text documents that contain many millions of words. To analyze these documents, we need to extract the text and convert it into a form of numerical representation.
Machine learning algorithms need numerical data to work with so that they can analyze the data and extract meaningful information. This is where the Bag of Words model comes in. This model extracts vocabulary from all the words in the documents and builds a model using a document-term matrix. This allows us to represent every document as a bag of words. We just keep track of word counts and disregard the grammatical details and the word order.
Let’s see what a document-term matrix is all about. A document-term matrix is a table that gives us counts of various words that occur in the document. So, a text document can be represented as a weighted combination of various words. We can set thresholds and choose more meaningful words. In a way, we are building a histogram of all the words in the document that will be used as a feature vector. This feature vector is used for text classification.
Consider the following sentences:
- Sentence 1: The children are playing in the hall
- Sentence 2: The hall has a lot of space
- Sentence 3: Lots of children like playing in an open space
If you consider all three sentences, we have the following 14 unique words:
- the
- children
- are
- playing
- in
- hall
- has
- a
- lot
- of
- space
- like
- an
- open
Let’s construct a histogram for each sentence by using the word count in each sentence. Each feature vector will be 14-dimensional because we have 14 unique words:
- Sentence 1: [2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
- Sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
- Sentence 3: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]
Now that we have extracted these feature vectors, we can use machine learning algorithms to analyze this data.
Let’s see how to build a Bag of Words model in NLTK. Create a new Python file and import the following packages:
import numpy as np from sklearn.feature_extraction.text import CountVectorizer from nltk.corpus import brown from text_chunker import chunker
Read the input data from the Brown corpus. We will use 5,400 words. Feel free to try it with as many words as you want:
# Read the data from the Brown corpus input_data = ' '.join(brown.words()[:5400])
Define the number of words in each chunk:
# Number of words in each chunk chunk_size = 800
Divide the input text into chunks:
text_chunks = chunker(input_data, chunk_size)
Convert the chunks into dictionary items:
# Convert to dict items chunks = [] for count, chunk in enumerate(text_chunks): d = {'index': count, 'text': chunk} chunks.append(d)
Extract the document term matrix where we get the count of each word. We will achieve this using the CountVectorizer method, which takes two input parameters. The first parameter is the minimum document frequency, and the second parameter is the maximum document frequency. The frequency refers to the number of occurrences of a word in the text:
# Extract the document term matrix count_vectorizer = CountVectorizer(min_df=7, max_df=20) document_term_matrix = count_vectorizer.fit_transform([chunk['text'] for chunk in chunks])
Extract the vocabulary and display it. The vocabulary refers to the list of distinct words that were extracted in the previous step:
# Extract the vocabulary and display it vocabulary = np.array(count_vectorizer.get_feature_names()) print("\nVocabulary:\n", vocabulary)
Generate the names for display:
# Generate names for chunks chunk_names = [] for i in range(len(text_chunks)): chunk_names.append('Chunk-' + str(i+1))
Print the document-term matrix:
# Print the document term matrix print("\nDocument term matrix:") formatted_text = '{:>12}' * (len(chunk_names) + 1) print('\n', formatted_text.format('Word', *chunk_names), '\n') for word, item in zip(vocabulary, document_term_matrix.T): # 'item' is a 'csr_matrix' data structure output = [word] + [str(freq) for freq in item.data] print(formatted_text.format(*output))
The full code is given in the file bag_of_words.py. If you run the code, you will get the following output:
All the words can be seen in the document-term matrix along with the corresponding counts in each chunk.
In this article, we explored the various underlying concepts in natural language processing. We also discussed the Bag of Words model where we turned random text into definite-length vectors by counting number of appearances for each word. NLP is just a subset of the vast descriptive knowledge of AI that Artificial Intelligence with Python, Second Edition equips you with. The new edition goes further by giving you the tools you need to explore the amazing world of intelligent apps and create your own applications.
About the authors
Alberto Artasanchez is a data scientist with over 25 years of consulting experience with Fortune 500 companies as well as start-ups. He has an extensive background in artificial intelligence and advanced algorithms. Mr. Artasanchez holds 8 AWS certifications including the Big Data Specialty and the Machine Learning Specialty certifications. He is an AWS Ambassador and publishes frequently in a variety of data science blogs. He is often tapped as a speaker on topics ranging from Data Science, Big Data and Analytics, underwriting optimization and fraud detection. He has a strong and extensive track record designing and building end-to-end machine learning platforms at scale. He graduated with a Master of Science degree from Wayne State University and a Bachelor of Art degree from Kalamazoo College. He is particularly interested in using Artificial Intelligence to build Data Lakes at scale. He is married to his lovely wife Karen and is addicted to CrossFit.
Prateek Joshi is the founder of Plutoshift and a published author of 9 books on Artificial Intelligence. He has been featured on Forbes 30 under 30, NBC, Bloomberg, CNBC, TechCrunch, and The Business Journals. He has been an invited speaker at conferences such as TEDx, Global Big Data Conference, Machine Learning Developers Conference, and Silicon Valley Deep Learning. Apart from Artificial Intelligence, some of the topics that excite him are number theory, cryptography, and quantum computing. His greater goal is to make Artificial Intelligence accessible to everyone so that it can impact billions of people around the world.