Introduction
In the rapidly evolving landscape of artificial intelligence, especially in NLP, large language models (LLMs) have swiftly transformed interactions with technology. Since the groundbreaking ‘Attention is all you need’ paper in 2017, the Transformer architecture, notably exemplified by ChatGPT, has become pivotal. GPT-3, a prime example, excels in generating coherent text. This article explores leveraging LLMs with BERT for tasks through pre-training, fine-tuning, and prompting, unraveling the keys to their exceptional performance.
Prerequisites: Knowledge of Transformers, BERT, and Large Language Models.
Table of contents
What are LLMs?
LLM stands for Large Language Model. LLMs are deep learning models designed to understand the meaning of human-like text and perform various tasks such as sentiment analysis, language modeling(next-word prediction), text generation, text summarization, and much more. They are trained on a huge amount of text data.
We use applications based on these LLMs daily without even realizing it. Google uses BERT(Bidirectional Encoder Representations for Transformers) for various applications such as query completion, understanding the context of queries, outputting more relevant and accurate search results, language translation, and more.
Deep learning techniques, specifically deep neural networks and advanced methods like self-attention, underpin the construction of these models. They learn the language’s patterns, structures, and semantics by training on extensive text data. Given their reliance on enormous datasets, training them from scratch consumes substantial time and resources, rendering it impractical.
There are techniques by which we can directly use these models for a specific task. So let’s discuss them in detail!
Ways to Train Large Language Models
While we can train these models to perform a specific task by conventional fine-tuning, there are other simple approaches as well that are possible now, but before that, let’s discuss the pre-training of LLM.
Pretraining
In pretraining, a vast amount of unlabeled text serves as the training data for a large language model. The question is, ‘How can we train a model on unlabeled data and then expect the model to predict the data accurately?’. Here comes the concept of ‘Self-Supervised Learning.’ In self-supervised learning, a model masks a word and tries to predict the next word with the help of the preceding words.
E.g. Suppose we have a sentence: ‘I am a data scientist’.
The model can create its own labeled data from this sentence like:
Text | Label |
I | am |
I am | a |
I am a | data |
I am a data | Scientist |
This is next-word prediction, and the models are auto-regressive. This can be done by an MLM (Masked Language Model). BERT, a masked language model, uses this technique to predict the masked word. We can think of MLM as a `fill in the blank` concept, in which the model predicts what word can fit in the blank.
There are different ways to predict the next word, but we only talk about BERT, the MLM, for this article. BERT can look at both the preceding and the succeeding words to understand the context of the sentence and predict the masked word.
So, as a high-level overview of pre-training, it is a technique in which the model learns to predict the next word in the text.
Finetuning
Finetuning is tweaking the model’s parameters to make it suitable for performing a specific task. After pretraining, the model undergoes fine-tuning, where you train for specific tasks like sentiment analysis, text generation, and finding document similarity, to name a few. We don’t have to train the model again on a large text. Rather, use the trained model to perform a task we want to perform. We will discuss how to finetune a Large Language Model in detail later in this article.
Prompting
Prompting is the easiest of all the 3 techniques but a bit tricky. It involves giving the model a context(Prompt) based on which the model performs tasks.
Think of it as teaching a child a chapter from their book in detail, being very discreet about the explanation, and then asking them to solve the problem related to that chapter.
In context to LLM, take, for example, ChatGPT. We set a context and ask the model to follow the instructions to solve the problem given.
Suppose I want ChatGPT to ask me to interview questions on Transformers only.
For a better experience and accurate output, you need to set a proper context and give a detailed task description.
Example:
A Data Scientist with 2 years of experience and preparing for a job interview at XYZ company. I love problem-solving, and currently working with state-of-the-art NLP models. I am up to date with the latest trends and technologies. Ask me very tough questions on the Transformer model that the interviewer of this company can ask based on the company’s previous experience. Ask me 10 questions and also give the answers to the questions.
The more detailed and specific you prompt, the better the results. The most fun part is that you can generate the prompt from the model itself and then add a personal touch or the information needed.
Finetuning Technique
There are different ways to finetune a model conventionally, and the different approaches depend on the specific problem you want to solve. Let’s discuss the techniques to fine-tune a model.
There are 3 ways of conventionally finetuning an LLM.
- Feature Extraction: This technique is used to extract the features from a given text, but why would we want to extract embeddings from a given text? The answer is very simple. Since computers do not understand text, there must be some representation of the text which can be used to perform different tasks. Once the embeddings are extracted, they can analyze sentiment, find document similarity, etc. In feature extraction, the backbone layers of the model are frozen, i.e., the parameters of those layers are not updated, and only the parameters of the classifier layers are updated. The classifier layers involve the fully connected network of layers.
- Full Model Finetuning: As the name suggests, this technique trains each model layer on the custom dataset for several epochs. The parameters of all the layers in the model are adjusted according to the new custom dataset. This can improve the model’s accuracy on the data and the specific task we want to perform. It is computationally expensive and takes a lot of time for the model to train, considering there are billions of parameters in the LLM.
- Adapter-Based Finetuning: Adapter-based finetuning is a comparatively new concept in which an additional randomly initialized layer or a module is added to the network, which is then trained for a specific task. In this technique, the parameters of the model are left undisturbed or the parameters of the model are not changed or tuned. Rather, the adapter layer parameters are trained. This technique helps in tuning the model in a computationally efficient manner.
Finetuning BERT
Now that we know the finetuning techniques let’s perform sentiment analysis on the IMDB movie reviews using BERT. BERT is a large language model that combines transformer layers and is encoder-only. Google developed it and has proven to perform very well on various tasks. BERT comes in different sizes and variants like BERT-base-uncased, BERT Large, RoBERTa, LegalBERT, and many more.
Let’s use the BERT model to perform sentiment analysis on IMDB movie reviews. For free GPU availability, it is recommended to use Google Colab. Let us start the training by loading some important libraries. Since BERT (Bidirectional Encoder Representations for Encoders) is based on Transformers, the first step would be to install transformers in our environment.
!pip install transformers
Let’s load some libraries that will help us to load the data as required by the BERT model, tokenize the loaded data, load the model we will use for classification, perform train-test-split, load our CSV file, and some more functions.
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
We have to change the device from CPU to GPU for faster computation.
device = torch.device("cuda")
The next step would be to load our dataset and look at the first 5 records in the dataset.
df = pd.read_csv('/content/drive/MyDrive/movie.csv')
df.head()
Training and Validation Sets
We will split our dataset into training and validation sets. You can also split the data into train, validation, and test sets, but for the sake of simplicity, I am just splitting the dataset into training and validation.
x_train, x_val, y_train, y_val = train_test_split(df.text, df.label, random_state = 42, test_size = 0.2, stratify = df.label)
Let us import and load the BERT model and tokenizer.
from transformers.models.bert.modeling_bert import BertForSequenceClassification
# import BERT-base pre-trained model
BERT = BertModel.from_pretrained('bert-base-uncased')
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
We will use the tokenizer to convert the text into tokens with a maximum length of 250 and padding and truncation when required.
train_tokens = tokenizer.batch_encode_plus(x_train.tolist(), max_length = 250, pad_to_max_length=True, truncation=True)
val_tokens = tokenizer.batch_encode_plus(x_val.tolist(), max_length = 250, pad_to_max_length=True, truncation=True)
The tokenizer returns a dictionary with three key-value pairs containing the input_ids, which are the tokens relating to a particular word; token_type_ids, which is a list of integers that distinguish between different segments or parts of the input; and attention_mask, which indicates which token to attend to.
Converting these values into tensors
train_ids = torch.tensor(train_tokens['input_ids'])
train_masks = torch.tensor(train_tokens['attention_mask'])
train_label = torch.tensor(y_train.tolist())
val_ids = torch.tensor(val_tokens['input_ids'])
val_masks = torch.tensor(val_tokens['attention_mask'])
val_label = torch.tensor(y_val.tolist())
Loading TensorDataset and DataLoaders to preprocess the data further and make it suitable for the model.
from torch.utils.data import TensorDataset, DataLoader
train_data = TensorDataset(train_ids, train_masks, train_label)
val_data = TensorDataset(val_ids, val_masks, val_label)
train_loader = DataLoader(train_data, batch_size = 32, shuffle = True)
val_loader = DataLoader(val_data, batch_size = 32, shuffle = True)
Our task is to freeze the parameters of BERT using our classifier and then fine-tune those layers on our custom dataset. So, let’s freeze the parameters of the model.
for param in BERT.parameters():
param.requires_grad = False
Now, we will have to define the forward and the backward pass for the layers that we have added. The BERT model will act as a feature extractor while we will have to define the forward and backward passes for classification explicitly.
class Model(nn.Module):
def __init__(self, bert):
super(Model, self).__init__()
self.bert = bert
self.dropout = nn.Dropout(0.1)
self.relu = nn.ReLU()
self.fc1 = nn.Linear(768, 512)
self.fc2 = nn.Linear(512, 2)
self.softmax = nn.LogSoftmax(dim=1)
def forward(self, sent_id, mask):
# Pass the inputs to the model
outputs = self.bert(sent_id, mask)
cls_hs = outputs.last_hidden_state[:, 0, :]
x = self.fc1(cls_hs)
x = self.relu(x)
x = self.dropout(x)
x = self.fc2(x)
x = self.softmax(x)
return x
Let’s move the model to GPU.
model = Model(BERT)
# push the model to GPU
model = model.to(device)
Defining the optimizer
# optimizer from hugging face transformers
from transformers import AdamW
# define the optimizer
optimizer = AdamW(model.parameters(),lr = 1e-5)
We have preprocessed the dataset and defined our model. Now is the time to train the model. We have to write a code to train and evaluate the model.
The train function:
def train():
model.train()
total_loss, total_accuracy = 0, 0
total_preds = []
for step, batch in enumerate(train_loader):
# Move batch to GPU if available
batch = [item.to(device) for item in batch]
sent_id, mask, labels = batch
# Clear previously calculated gradients
optimizer.zero_grad()
# Get model predictions for the current batch
preds = model(sent_id, mask)
# Calculate the loss between predictions and labels
loss_function = nn.CrossEntropyLoss()
loss = loss_function(preds, labels)
# Add to the total loss
total_loss += loss.item()
# Backward pass and gradient update
loss.backward()
optimizer.step()
# Move predictions to CPU and convert to numpy array
preds = preds.detach().cpu().numpy()
# Append the model predictions
total_preds.append(preds)
# Compute the average loss
avg_loss = total_loss / len(train_loader)
# Concatenate the predictions
total_preds = np.concatenate(total_preds, axis=0)
# Return the average loss and predictions
return avg_loss, total_preds
The evaluation function:
def evaluate():
model.eval()
total_loss, total_accuracy = 0, 0
total_preds = []
for step, batch in enumerate(val_loader):
# Move batch to GPU if available
batch = [item.to(device) for item in batch]
sent_id, mask, labels = batch
# Clear previously calculated gradients
optimizer.zero_grad()
# Get model predictions for the current batch
preds = model(sent_id, mask)
# Calculate the loss between predictions and labels
loss_function = nn.CrossEntropyLoss()
loss = loss_function(preds, labels)
# Add to the total loss
total_loss += loss.item()
# Backward pass and gradient update
loss.backward()
optimizer.step()
# Move predictions to CPU and convert to numpy array
preds = preds.detach().cpu().numpy()
# Append the model predictions
total_preds.append(preds)
# Compute the average loss
avg_loss = total_loss / len(val_loader)
# Concatenate the predictions
total_preds = np.concatenate(total_preds, axis=0)
# Return the average loss and predictions
return avg_loss, total_preds
Train the Model
We will now use these functions to train the model:
# set initial loss to infinite
best_valid_loss = float('inf')
#defining epochs
epochs = 5
# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]
#for each epoch
for epoch in range(epochs):
print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')
# append training and validation loss
train_losses.append(train_loss)
valid_losses.append(valid_loss)
print(f'\nTraining Loss: {train_loss:.3f}')
print(f'Validation Loss: {valid_loss:.3f}')
And there you have it. You can use your trained model to infer any data or text you choose.
Also Read: Why and how to use BERT for NLP Text Classification?
Conclusion
This article explored the world of LLMs and BERT and their significant impact on natural language processing (NLP). We discussed the pretraining process, where LLMs are trained on large amounts of unlabeled text using self-supervised learning. We also delved into finetuning, which involves adapting a pre-trained model for specific tasks and prompting, where models are provided with context to generate relevant outputs. Additionally, we examined different finetuning techniques, such as feature extraction, full model finetuning, and adapter-based finetuning. LLMs have revolutionized NLP and continue to drive advancements in various applications.
Key Takeaways
- LLMs, such as BERT, are powerful models trained on vast amounts of text data, enabling them to understand and generate human-like text.
- Pretraining involves training LLMs on unlabeled text using self-supervised learning techniques like masked language modeling (MLM).
- Finetuning is adapting a pre-trained LLM for specific tasks by extracting features, training the entire model, or using adapter-based techniques, depending on the requirements.
Frequently Asked Questions
A. LLMs employ self-supervised learning techniques like masked language modeling, where they predict the next word based on the context of surrounding words, effectively creating labeled data from unlabeled text.
A. Finetuning allows LLMs to adapt to specific tasks by adjusting their parameters, making them suitable for sentiment analysis, text generation, or document similarity tasks. It builds upon the pre-trained knowledge of the model.
A. Prompting involves providing context or instructions to LLMs to generate relevant outputs. Users can guide the model to answer questions, generate text, or perform specific tasks based on the given context by setting a specific prompt.
Master the forefront of GenAI technology with our Generative AI pinnacle program, wherein you will dive into 200+ hours of in-depth learning and get exclusive 75+ mentorship sessions. Check it out now and get a clear roadmap for your dream job!