As I’m writing this, the model library on Huggingface consists of 11,256 models, and by the time you’re reading this, this number will only have increased. With so many models to choose from, it is no wonder that many get overwhelmed and don’t know any more which model to choose for their NLP tasks.
It’d be great if there was a convenient way to try out different models for the same task and compare those models against each other on a variety of metrics. Sagemaker Experiments does exactly that: It lets you organize, track, compare, and evaluate NLP models very easily. In this article we will pit two NLP models against each other and compare their performances.
All the code is available in this Github repository.
Data Preparation
The data preparation for this project article can be found in this Python script. We will use the IMDB dataset from Huggingface, which is a dataset for binary sentiment classification. The data preparation is pretty standard, the only thing to note is that we need to tokenize the data for each model separately. We will then store the data in S3 folders, one per model.
The models we are comparing in this article will be distilbert-base-uncased and distilroberta-base. Obviously, Sagemaker Experiments is not limited to two models and actually allows to track and compare several NLP models.
Metric definitions
First, it is important to understand how Sagemaker Experiments will the metrics which we will then use to compare the models. The values for these metrics are collected from the logs that are produced during model training. This usually means that the training script has to write out these metrics explicitly.
In our example, we will use Huggingface’s Trainer object which will take care of writing the metrics into the log for us. All we have to do is to define the metrics in the training script. The Trainer object will then automatically write them out into the training log (note that the loss metric is written out by default and that all metrics have the prefix “eval_”):
def compute_metrics(pred): labels = pred.label_ids preds = pred.predictions.argmax(-1) precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average="binary") acc = accuracy_score(labels, preds) return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}
Defining the evaluation metrics
Evaluation metrics in the training logs
That means we can capture these metrics during the training job via regular expressions, which we can define as follows:
metric_definitions=[ {"Name": "test:loss", "Regex": "\'eval_loss\': (.*?),"}, {"Name": "test:accuracy", "Regex": "\'eval_accuracy\': (.*?),"}, {"Name": "test:f1", "Regex": "\'eval_f1\': (.*?),"}, {"Name": "test:precision", "Regex": "\'eval_precision\': (.*?),"}, {"Name": "test:recall", "Regex": "\'eval_recall\': (.*?),"}, ]
We will pass those to the estimator we will create further down below to capture these metrics, which will allow us to compare the different NLP models.
Running a Sagemaker Experiment
To organize and track the models we need to create a Sagemaker Experiment object:
import boto3 from smexperiments.experiment import Experiment sm = boto3.client('sagemaker') nlp_experiment = Experiment.create( experiment_name="nlp-classification", description="NLP Classification", sagemaker_boto_client=sm)
Once that is done, we can kick off the training. We use ml.p3.2xlarge for the Sagemaker Training jobs which will complete the fine-tuning in about 30 minutes. Note that we create a Trial object for each training job. These trials get associated with the experiment we created above which will allow us to track and compare the models:
# loop over models for model_name in model_list: trial_name = f"nlp-trial-{model_name}-{int(time.time())}" # create a trial that will be attached to the experiment nlp_trial = Trial.create( trial_name=trial_name, experiment_name=nlp_experiment.experiment_name, sagemaker_boto_client=sm, ) hyperparameters = {'epochs': 2, 'train_batch_size': 32, 'model_name': model_name } huggingface_estimator = HuggingFace(entry_point='train.py', source_dir='./scripts', instance_type='ml.p3.2xlarge', instance_count=1, role=role, transformers_version='4.6', pytorch_version='1.7', py_version='py36', hyperparameters = hyperparameters, metric_definitions=metric_definitions, enable_sagemaker_metrics=True,) nlp_training_job_name = f"nlp-training-job-{model_name}-{int(time.time())}" s3_prefix = s3_prefix_orig + model_name training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train' test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test' huggingface_estimator.fit( inputs={'train': training_input_path, 'test': test_input_path}, job_name=nlp_training_job_name, experiment_config={ "TrialName": nlp_trial.trial_name, "TrialComponentDisplayName": "Training", }, wait=False, )
The code above kicks off two training jobs (one for each model) in parallel. However, if that is not possible on your account (maybe the number of training job instances is restricted in your AWS account), you can also run these training jobs sequentially. As long as they get associated with the same experiment via the Trial object you will be able to evaluate and compare the models.
Comparing the models
After around 30 mins both models have been trained and it is time to retrieve the results:
from sagemaker.analytics import ExperimentAnalytics trial_component_analytics = ExperimentAnalytics( sagemaker_session=Session(sess, sm), experiment_name="nlp-classification" ) df_results = trial_component_analytics.dataframe()
The resulting dataframe holds all the information required to compare the two models. For example, we can retrieve the average values for all the metrics we defined like this:
We can see that distilroberta-base performed slightly better with respect to recall and distilbert-base-uncased performed better with respect to F1 score, precision, and accuracy. There are many more columns in the dataframe which I will leave to the reader to explore further.
Conclusion
In this article we have created a Sagemaker Experiment to track and compare NLP models. We have created Trials for each of the models and collected various evaluation metrics. After the models have been fine-tuned we were able to access these metrics via a Pandas dataframe and compare the models in a convenient way.
Article originally posted here by Heiko Hotz. Reposted with permission.
About the author: Heiko Hotz is a Senior Solutions Architect for AI & Machine Learning at AWS with over 20 years of experience in the technology sector. He focuses on Natural Language Processing (NLP) and helps AWS customers to be successful on their NLP journey.