Wednesday, January 29, 2025
Google search engine
HomeData Modelling & AIMany Models Training with Hyperparameter Optimization

Many Models Training with Hyperparameter Optimization

This article presents an approach for you to train multiple machine learning models, optimizing the hyperparameters of each model in an automated way with Azure Machine Learning. Before getting into the part where I explain how to do this, let’s first get a better understanding of its motivation. I will use an example applied to the problem of Demand Forecast, a scenario in which it is common to apply this type of approach.

Demand forecasting techniques are widely used by companies that need to predict supply and demand, for example, forecasting sales of products in stores and predicting home energy consumption. Accurate forecasting of consumption is essential for adjusted decisions related to the replacement of products in stock, negotiation of contracts with suppliers, investment in new factories, and logistics for the delivery of products. Demand forecasting practitioners can apply machine learning algorithms to learn patterns from a set of historical data to produce a model capable of predicting the demand in a given period from the information of the previous periods.

In most business scenarios, the demand forecast you want is related to the demand projection for a resource in a particular location. For example, in a supply chain, you may want to know the forecast of consumption of a specific product by a store. In these cases, for example, a good practice aiming at a better performance in the forecast is to train a model for each store instead of having a single general model, so we will have many models to train.

The performance of a model, among other things, is directly related to its training, the stage of machine learning in which one seeks to learn the best parameters of the model. However, before the training process, some parameters must be established. These parameters are known as hyperparameters and are often defined by data scientists based on their practical experience. In the case of the well-known learning algorithm Gradient Boosting, for example, some of the hyperparameters that must be defined before running it are the learning rate and the number of estimators.

A typical problem is choosing the best hyperparameters to train your model. Some hyperparameter optimization (HPO) techniques were developed over time, such as Random Search, Grid Search, or the Bayesian method. This article is not intended to describe how these methods work but how to use these hyperparameter optimization techniques to train multiple models.

Putting the Idea into Practice

To illustrate how to optimize hyperparameters in training multiple models, I will use Azure Machine Learning, also known as Azure ML, a complete service for you to run and manage your machine learning project. Look at the service’s website to get a comprehensive view and learn about Azure ML. This section will discuss the main elements needed to train multiple models with hyperparameter optimization.

In Azure Machine Learning, you use a workspace, a centralized place to work on your code, data, and computing resources for training and inference. It’s also in the workspace where you keep the history of training runs and record the best performing models to make inferences.

Typically, the data you use in your Jobs for training and evaluating models is stored in some cloud storage service, for example, Azure Blob Storage. The datastores in an Azure Machine Learning workspace represent the links to these services. The data stored is accessed in your workspaces through data assets, elements that reference the data stored in the datastore. Data assets make accessing, reusing, and versioning your data more manageable.

Another essential element in an Azure ML workspace is the pipeline. It represents a sequence of actions you perform in your data science workflow. A training pipeline, for example, typically contains steps to perform data processing, model training, and model evaluation. Pipelines help you standardize and automate your tasks, and a Job represents running a pipeline in Azure ML. Figure 1 illustrates an overview of a workspace and its elements to train many models in Azure Machine Learning. Notice that each pair formed by a store and product has its corresponding training, job, and model dataset. In this case, the training is performed in parallel by a child job, called Many models train job, as explained later.

Figure 1: Elements for training many models in Azure ML.

In the workspace, you can create a compute cluster to run your training process in parallel on the different nodes and cores available in the cluster. The cluster is an essential element in our case because as we are talking about many models here, thousands maybe, it is crucial that you can train the models simultaneously.

To run the training in parallel, we will use a type of component available in Azure ML, the ParallelRunStep. It allows the execution of a task in parallel; in our case, each parallel run corresponds to the training of one of the many models and is a child job of ParallelRunStep. Figure 2 exemplifies, in general, how the execution of a pipeline with ParallelRunStep works. Azure ML distributes the child jobs executions (Many models train job) in the computer cluster cores. In this example, the cluster has two nodes with four cores each.

Figure 2: Many Models Parallel Training.

Now that I’ve covered the components needed to train multiple models in parallel, an additional element to include in the architecture is HyperDrive, a framework for Hyperparameter Optimization to produce better models reducing training time and resources. HyperDrive performs multiple trials, and at each one, it trains the model with a specific combination of the hyperparameter values to test its performance against a particular metric, for example, Accuracy. In the end, it tells you the best hyperparameters values to use based on the metric you want to optimize. In our example project, we will use HyperDrive to optimize the hyperparameters for each one of the many models based on the MAE (Mean Absolute Error) metric.

Hyperdrive also does parallel processing, so the trials are distributed across the different cores of the compute cluster. Figure 3 illustrates the same view presented earlier, but with the presence of HyperDrive. Each ParallelRunStep task is responsible for training one of the many models, and it executes HyperDrive to optimize the hyperparameters of the corresponding model. Each trial run by Hyperdrive represents a job in AzureML.

Figure 3: Parallel Training of Many Models with HPO.

The distribution of the jobs in the nodes shown in Figure 3 is merely illustrative to give you the notion of distribution. In the real world, the AzureML scheduler is the process that will define this job allocation.

Many Models Solution Accelerator Boosting with HyperDrive

The good news for you is that Microsoft’s Early Access Engineering team has already developed an asset publicly available called the Many Models Solution Accelerator. This asset does most of the work necessary for you to do many models training. It is based on the prediction of the number of products sold in grocery stores. Each model in the example corresponds to a store and product pair.

If you only want to train the many models, follow the getting started guide in the official solution accelerator repository. However, to train the many models with hyperparameter optimization, which is the focus of this article, you will need some additional steps. 

I forked the solution accelerator repository and made some additions to help with the many models training with HPO training. I will explain how to do this from the beginning. You need access to Azure’s subscription and be familiar with running Jupyter notebooks in Python.

Initially, it is necessary to deploy the resources to Azure. Just click this button to run an ARM template to create resources in your Azure subscription.

Figure 04 illustrates an example of filling in the template screen to create resources in Azure. You can fill in the most pertinent data for your case, such as the region chosen in Azure to run your resources. 

Figure 4: Filling in the template form to create the resources.

After the resource creation, there are four essential steps to train the many models with HPO. The first is creating the development environment through the instructions in EnvironmentSetup.md. After finishing this step, you will have created a compute instance to run the notebooks in your Azure ML workspace and cloned the solution accelerator code into the instance. The Notebooks section of your workspace should look like the example in Figure 5.

Figure 5: Azure ML Workspace.

The second step is creating and configuring the Azure ML workspace when you deploy a compute cluster for training. You do this by executing the steps in the notebook 00_Setup_AML_Workspace.ipynb. When creating the cluster, you can choose the most convenient VM size for your project. 

In the third step, you prepare the data for training and validating the models. You do this by performing the steps in the notebook 01_Data_Preparation.ipynb, where you download the data and then register it as a data asset in the workspace. 

After completing the environment’s setup and the data preparation steps, you are ready to train the models. At this point, you can use an AutoML approach or a custom training script. The latter will be our choice, so we can choose our training algorithm and use HyperDrive. To do that, we will use the notebook 02_CustomScript_HPO_Training_Pipeline.ipynb, an adapted version of the notebook 02_CustomScript_Training_Pipeline.ipynb from the original Github repo. The new one comes with all the instructions and comments necessary for training using HyperDrive. 

By running the pipeline, you can observe the execution of your Jobs by clicking the Azure ML Jobs menu option. When you do this, you will be able to see a screen like the one presented in Figure 6.

Figure 6: Monitoring jobs execution.

To view all Jobs, remember to select the include child Jobs option when accessing the Azure ML Jobs menu, as shown in Figure 7.

Figure 7: Include child jobs option.

Figure 6  has several job types: The Pipeline type is the parent job that starts the whole training process. The Pipeline step is the ParallelRunStep stage in the pipeline. This step initiates several training tasks in parallel, each corresponding to a job of the azureml.ManyModelsCustomTrain type. Each of these jobs trains one of many models and starts a Hyperdrive HPO process, a Command-type job. Finally, each trial run by the HPO process corresponds to a sweep job.  

Congratulations! After completing this step, you will have finished training the many models with HPO.

Discussion and Next Steps

The notebook where you start the training process is well documented, explaining all the steps to do the training. The python programs are also self-explanatory. So, to avoid being repetitive, I will comment on just a few code fragments to help you better understand how the process’s parallelism configuration works. The following fragment, part of the 02_CustomScript_HPO_Training_Pipeline.ipynb notebook, is where the parallel run configuration is defined:

parallel_run_config = ParallelRunConfig(source_directory='./scripts',
                                        entry_script='train.py',
                                        mini_batch_size="1",
                                        run_invocation_timeout=timeout,
                                        error_threshold=-1,
                                        output_action="append_row",
                                        environment=train_env,
                                        process_count_per_node=2,
                                        compute_target=compute,
                                        node_count=2,
                                        run_max_try=3)

The notebook describes this configuration, but two essential elements worth the comment are node_count and process_count_per_node. The first indicates the number of compute cluster nodes that ParallelRunStep will use to run jobs in parallel. 

The second element, process_count_per_node, corresponds to the number of processes executed simultaneously on each node. We usually choose a number equivalent to the number of colors of the node for this element. In our case, as each ParallelRunStep Job alone will start an HPO process in which more than one trial will run in parallel, we left some cores available to run the HyperDrive trials. So because each cluster node has four cores, let’s set the maximum of two processes per node.

The train.py program is responsible for training each model. It contains the code snippet below, a code fragment that performs the hyperdrive configuration. An essential element is the max_concurrent_runs that defines the maximum number of trials running in parallel on each HPO process. 

hyperdrive = HyperDriveConfig(run_config=script_config, 
                              hyperparameter_sampling=params, 
                              policy=None, 
                              primary_metric_name='mae', 
                              primary_metric_goal=PrimaryMetricGoal.MINIMIZE, 
                              max_total_runs=6, 
                              max_duration_minutes=10, 
                              max_concurrent_runs=2) 

The max_total_runs and max_duration_minutes elements specify the number of trials to be performed by HyperDrive. The first means the maximum number of trials, while the second sets a time limit, which occurs first will determine the ending of the optimization process.

Several trials are performed during the HPO process to find the best hyperparameter values for each of the many models. You can look at the HPO process results as shown in Figure 8. It shows the hyperparameter optimization trials for one of the models. In the left chart, you have the MAE for each trial, and on the right, there is the hyperparameter combination used on each trial and the respective MAE value. You can see a significant difference in MAE values depending on the combination of hyperparameters. Here is where HyperDrive shows its value to your training process.

Figure 8: Trials Analysis.

We applied the approach in a basic example to show how to do Many Models Training with HPO on Azure ML. This technique can help you achieve better forecasting performance compared to a scenario where you manually define the model’s hyperparameters. This technique comes with an additional cost of running the trials and must be used wisely to balance cost versus performance when applying the method to your project. If you are willing to move ahead, I suggest you consider some improvements to the method shown here. One is testing other learning algorithms such as Extreme Gradient Boosting or XGBoost to achieve better forecasting performance. Another thing to consider is using a model validation approach based on k-fold cross-validation to assess model performance without overfitting. 

Also, Analyze the option of using more than one cluster, for example, running the HyperDrive trials on a cluster with higher processing capacity if the HPO process is too long.

 Another important point to mention is that the steps taken in this article focus on training only. For inferencing, you need to look at the notebook 03_CustomScript_Forecasting_Pipeline.ipynb, also available in the Solution Accelerator repo.

Thanks for reading, and good luck with your machine learning projects.

RELATED ARTICLES

Most Popular

Recent Comments