Sunday, September 15, 2024
Google search engine
HomeData Modelling & AIMLOps: Monitoring and Managing Drift

MLOps: Monitoring and Managing Drift

Editor’s note: Oliver Zeigermann is a speaker for ODSC West 2023 this Fall. Be sure to check out his talk, “MLOps: Monitoring and Managing Drift,” there!

The trouble with machine learning starts after you put your model into production. 

Typically, you want to bring something into production that gives you a clear benefit on the business side. Because, if it doesn’t, why bother? So, we need to monitor the performance of our model in production when there is even the slightest chance that the world covered by our model is prone to change. There may be exceptions, but I will assume that this almost always is the case.

How to monitor?

In figure 1, you can see a pretty typical setup to monitor relevant data about your model. The obvious part on the left side is the prediction service using the trained model. This service forwards input and output data to a monitoring service which likely runs as a separate process. Some statistical magic happens inside this monitoring service which is displayed as a metrics endpoint. Evidently has proven to do a decent job of calculating such metrics. The representation of those metrics can be polled by a database that records those metrics over time. Eventually, that information is displayed on some sort of dashboard and an alerting system is put into action on either the database or the dashboard service. Prometheus and Grafana are the default solutions for the database and the dashboard respectively. 

Figure 1: High-Level Architecture for ML Monitoring

To monitor the performance it would be most comfortable to simply track the metric you used for training the model. This comes with a catch, though. To do so, you need to know the ground truth. There are cases where you immediately get some kind of feedback if your prediction was a good one, but often enough you never get this or only in small quantities and most of the time with a significant delay. 

Since we assume that your machine learning service has an impact on your business, you would rather want to know if the performance is degrading as soon as possible. One such surrogate metric can be the similarity of the distributions of the data used during training on one side and the data you see during prediction. This can apply to both input and/or output data. 

Figure 2: Distributions of reference and production data are similar

Figure 2 shows such a comparison where the two distributions are reasonably similar while figure 3 shows a deviation. You can compute a numerical representation of the similarity using a statistical test as it is provided by Evidently and other monitoring libraries. 

Such surrogate metrics will never be as accurate as comparing your predictions to the ground truth, but might be the best you can do.

Figure 3: Distributions of reference and production data divert

Things look off, what to do now?

Even though this is more of an art than a craft, in any case, you should look at the distribution of the production data by plotting it against the reference distribution. The score coming out of your statistics test is just a single value, but there is so much more to be seen in the shape of a distribution. What do you see? Here are a few rules of thumb:  

  1. Is it basically the same shape, just drifting to one side? For this example, this means your audience gets older or younger in general. How big is the effect? How important is the feature age for the prediction? Would you even be able to train a new model without that feature that drifted? 
  2. Do you see a complete subgroup forming beyond the original shape? To see this you might have to look at more than one dimension at the same time. In this case, you can be pretty sure that your model does not perform well on that data as models always never extrapolate well to data outside the range of the training data. In order to prevent damage, a quick way of handling this is to exclude that subgroup from the prediction by the model and fall back to a different system like a simple rule-based approach.
  3. You might see a change in the output, i.e. in the predicted values. Consider you have a binary model and the percentage of good cases goes up and up. This can be a real danger for your business and your model might have been hacked by an adversarial attack. In such a case you must replace that model quickly. If you assume an adversarial attack, simply retraining your model – even with the old data – might fix the issue as attackers might no longer be able to exploit the random properties of your model. You might try this as an example or replace the model completely with a fallback system that might be as radical as going back to manual classification.
  4. Sometimes a statistical test might see a big difference, but you see it is only a shift within the original range of your distribution. This may mean that your model is general enough to handle the change in the input distribution. If you have at least a little bit of newly collected ground truth data, a validation run with that data might confirm or push back on that assumption. 

Once you have the suspicion that your model no longer performs as it did, you basically need to repeat the steps you did when you trained your model initially. However, the situation most likely is more challenging as you might not really have all the most up-to-date data. 

My Workshop at ODSC West 2023 in San Francisco

My technical, hands-on workshop “MLOps: Monitoring and Managing Drift” covers these topics in more depth. It will be held in person at ODSC West 2023 in San Francisco. The workshop’s objective is to ensure that you are equipped with the essential knowledge and practical tools to proficiently detect and handle drift.

Link to the workshop and more details: https://odsc.com/speakers/mlops-monitoring-and-managing-drift/

About the author/ODSC West 2023 speaker:

Oliver Zeigermann works as an AI engineer from Hamburg, Germany. He has been developing software with different approaches and programming languages for more than 3 decades. In the past decade, he has been focusing on Machine Learning and its interactions with humans.

RELATED ARTICLES

Most Popular

Recent Comments