The AI area of Natural Language Processing, or NLP, throughout its gigantic language models — yes, GPT-3, I’m watching you — presents what it’s perceived as a revolution in machines’ capabilities to perform the most distinct language tasks.
Due to that, the perception of the public as a whole is split: some perceive that these new language models are going to pave the way to a Skynet type of technology, while others dismiss them as hype-fueled technologies that will live in dusty shelves, or HDD drives, in little to no time.
[Free download: Natural Language Processing Guide: 30 Free ODSC Resources to Learn NLP]
Invitation to Learn NLP
Motivated by this, I’m creating this series of stories that will help you learn NLP from scratch in a friendly way.
I’m also inviting you to join me in this series to learn NLP and be well-versed in an AI language-model-shaped future.
To join me, you’ll need to have little experience with Python and Jupyter Notebooks, and for the most part, I won’t even ask you to have anything installed on your machine.
This series will differ dramatically from the Stanford course in terms of the depth that we’ll approach statistics and calculus. I’ll try my best to avoid getting into the specifics since, for the most part, we will be using Python libraries that already implement most of the structure that we will need. Still, if you are looking to learn more about those topics, I strongly advise you to study the course notes to learn NLP.
We will use Deepnote to create our Python notebooks and develop the whole course using the cloud.
Why Deepnote? Deepnote extends the Jupyter notebooks experience with real-time collaboration and offers free compute with no pre-install. You can duplicate my Deepnote notebook here and follow me as I walk through this project for the best experience.
For the course, we will use a guideline Stanford’s Winter 2020 CS224N material since it has a comprehensive approach, a Youtube playlist containing the lessons, and other resources made available from Stanford’s students. If you wish to know more about the course, you can access its website.
We will first start with NLP’s basics and work our way to its key methods: RNN, attention, transformers, etc. At the end of the course, we will be able to create some of the following applications:
- Word meaning
- Dependency parsing
- Machine translation
- Question answering
A brief introduction to NLP
“Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.” — Wikipedia
By this definition, other than seeing that NLP is a vast multidisciplinary field, it leads us to a question: How can we make computer programs analyze natural language data?
The first step is learning how we can represent words and their meanings in a computational environment.
Meaning of Words
For a few years, NLP work was mainly based on modeling the word synonyms and hypernyms. One way to figure out those sets was by looking at the word definition in a dictionary.
We can do this by using Python, with a library called NLTK.
Trying out NLTK to learn NLP
“NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.” — Nltk.org
With NLTK, we can search for a word meaning by using a built-in lexical database called WordNet. WordNet presents nouns, verbs, adjectives, and adverbs grouped in sets of cognitive synonyms — synsets — with each synset representing a distinct concept.
To start, first, let’s log in to create a “New Project” on Deepnote. With the notebook opened, let’s install the NLTK library by typing in a cell and running with Shift+Enter — for those of you that use different Python notebook platforms, the shortcuts you know should work just fine.
After this, we need to import the NLTK library and download the WordNet database.
And with this, we are all set. To get the synsets objects from a word like ‘language,’ we must import the WordNet database and use the method. synsets()
.
It seems the resulting object does not give us all the information needed about the word, just some encrypted-like information about each synset
. For better viewing, we can loop over the results and format the synset object using it’s pos()
and lemmas()
With the help of a custom list of objects that will “pretty print” the word representation.
For more information about the WordNet package in NLTK, and to learn NLP, you can view this link
NLTK Pitfalls
You can see it works properly as a dictionary, but it has a few problems developing NLP applications.
It’s very subjective; it requires massive amounts of human labor, so it’s virtually impossible to maintain the corpus. It also won’t calculate word similarity effectively, which is really important for our applications. That can lead us to write unreliable or easily outdated AI software.
Discrete Representations
Another approach was to use discrete vectors (vectors with 0’s and 1’s) to represent different words, but this method also had several pitfalls. For example, it relied mainly on WordNet’s list of synonyms, and that would cause us some problems.
Due to that, the area move on to another approach that uses Word Vectors to represent words.
Representing words by their context
You shall know a word by the company it keeps (Firth, J. R. 1957:11).
The concept of word vectors enables us to work with words and the context (the words that are nearby) of the central word. That allows us to work the similarity between words in different contexts.
Word Vectors (also called embeddings or representations)
Word vectors for learning NLP are represented by n-dimensional vectors containing non-zero values, representing words by their relationship to other words. For each word is built a dense vector like the following:
If you wish to expand your knowledge on the word vectors topic, I recommend this awesome notebook written by Allison Parrish.
Word vectors can be created with different methods. In Stanford’s CS224N course, the Word2Vec (Mikolov et al. 2013)[1][2] framework is presented:
Overview of Word2Vec:
- Assemble a large corpus of text
- Represent every word in a fixed vocabulary by an n-dimensional vector
- For each position in the text, define a center word and context words.
- Use the similarity of the vectors to calculate a probability of a context given a central word.
- Repeat and just the word vectors to maximize this probability
This process is mainly implemented by using neural networks to learn the association between words. We won’t be implementing the Word2Vec framework to train a model; instead, we will use the Word2Vec model from the Python library gensim.
“Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.” — Gensim Website
Exploring Word Vectors relationships with the gensim library
This example will use gensim’s embedded api
and Word2Vec
modules to download a corpus of text and create a Word2Vec model to visualize some interesting word vector features. First, we need to install the gensim package.
Now we need to obtain the corpus that we will use to create our Word2Vec model. To do that, we can use the api
module. We will download the text8 corpus, created with text extracted from Wikipedia. After that, we need to use the corpus to create our Word2Vec model, and we do that by importing the Word2Vec model and instantiating it, passing the corpus as a constructor parameter.
We can already work with the word vectors by performing tasks like finding similar words and selecting words that don’t fit in a group. You can read gensim’s docs to find out more about the available word vector operations.
Remember that maybe the biggest pitfall from NLTK was not able to calculate the similarity between two words? By using word vectors, we can use this feature to perform even more complex tasks. We can, for instance, find analogies between sets of words.
We can perform mathematical operations with words and model expressions like: “King — Man + Woman = ?”
To evaluate that expression using gensim, we can use the most_similar()
method passing as positive values ‘woman’
and ‘king’
and as a negative value ‘man’
.
We can also create a analogy
function to make it easier to perform this operation:
Word vectors created the foundation for modern distributed word representation and, consequently, paved the way for NLP advancements.
Conclusion on learning NLP
In the next post, we will discuss word vectors and word senses, topics of the Stanford course’s second lecture. I hope you enjoyed reading this post. If you have any questions, feel free to leave a comment.
Thank you for your time.
Take care, and keep coding!
References to learn NLP
– CS 224N Lecture 1 Slides
– CS 224N Lecture 1 Video
– [1] — Efficient Estimation of Word Representations in Vector Space
– [2] — Distributed Representations of Words and Phrases and their Compositionality
Software and Libraries
Article by Thiago Candido