Editor’s note: Swagata Ashwani is a speaker for ODSC East 2023. Be sure to check out her talk, “Creating a Custom Vocabulary for NLP tasks using exBERT and spaCY,” there!
Natural Language Processing (NLP) tasks involve analyzing, understanding, and generating human language. However, the first step in any NLP task is to pre-process the text for training. This involves tokenization, which is the process of breaking down text into individual tokens such as words, punctuation, and other meaningful units.
For example, if you have an English language model, it already includes over 1 million items of vocabulary, many classes of entity recognition, and a lot of compound noun recognition. However, what happens when you need to add new terms and customize the vocabulary? In this tutorial, we’ll explore how to create a custom vocabulary that can be further used for any NLP task.
Introduction to Language Models
Before we dive into creating custom vocabularies, it’s essential to understand the terminology related to language models. A language model is a statistical model that assigns probabilities to a sequence of words. The probability of a particular sequence of words is calculated based on the probability of each word in the sequence given the previous words. Vocabulary refers to the set of words used in a language model. The size of the vocabulary determines the number of unique tokens that the model can recognize.
Why do we need a custom vocabulary?
In some scenarios, a custom vocabulary is necessary. For example, if you’re working on a domain-specific NLP task, such as medical records or legal documents, the model may need to recognize terms that aren’t included in the standard vocabulary. Additionally, if you’re working with social media text, you may need to include slang or informal language.
How to add custom terms to a vocabulary?
There are several approaches to adding custom terms to a vocabulary, but in this tutorial, we’ll focus on exBERT and spaCY tokenizer. exBERT is an approach to adding additional terms to an existing vocabulary. spaCY is an open-source tool for NLP tasks that can be used to achieve vocabulary customization.
The step-by-step approach of creating a custom vocabulary in Python
1. Install the required libraries: To use exBERT and spaCY tokenizer, you need to install the transformers and spaCY libraries.
2. Load the pre-trained language model: Load the pre-trained language model that you want to customize. For example, you can use the BERT model from the Hugging Face library.
3. Tokenize the text: Tokenize the text using the pre-trained tokenizer. This will create a list of tokens that can be used to create a custom vocabulary.
4. Create a list of custom terms: Create a list of custom terms that you want to add to the vocabulary.
5. Add the custom terms to the tokenizer: Use the exBERT approach to add the custom terms to the tokenizer.
6. Test the tokenizer: Test the tokenizer by tokenizing some sample text that includes the custom terms. If the tokenizer recognizes the custom terms, then you’ve successfully created a custom vocabulary.
Swagata Ashwani is a Data Professional with over 6 years of experience in the Healthcare, Retail, and Platform Integration industries. She is an avid blogger and writes about state-of-the-art developments in the AI space. She is particularly interested in Natural Language Processing, and focuses on researching how to make NLP models work in a practical setting. In her spare time, she loves to play her guitar, sip masala chai, and find new spots for doing Yoga. Connect with her on LinkedIn.