Optimized Retraining Guide for MLOps

In general, it is important to clearly understand your business requirements and the problem you are trying to solve when determining the best approach to automate the retraining of an active machine learning model. It is also important to continuously monitor the performance of the model and make adjustments to the retraining cadence and metrics as needed.

Barbara

Approaches to automatic model retraining

  • Fixed: Retraining a set cadence (e.g., daily, weekly, monthly)
  • Dynamic: Ad hoc triggered retraining based on model performance metrics.

And this whole process can be deployed in 2 environments:

  • Cloud: The most common. It offers great advantages of flexibility
  • Edge: Ideal for use cases requiring privacy, security or low latency

Retraining strategy:

Automating the retraining of a machine learning model can be a complex task, but there are some best practices that can help guide the design.

1. Metrics to trigger retraining: 

The metrics used to trigger retraining will depend on the model and its usage. Each metric will need a threshold to trigger retraining when model performance falls below.

Some ideal metrics to trigger model retraining are:

  • Prediction (score or label) drift
  • Performance metric degradation
  • Performance metric degradation for specific segments/cohorts.
  • Feature drift
  • Embeddings drift

2. Ensuring the new model works

The new model will have to be tested or validated before being put into production to replace the old one. Several approaches are recommended for this purpose:

  • Human review
  • Automated metric checks in the CI/CD process

3. Promotion strategy for the new model

The promotion strategy for the new model will depend on its impact on the company. In some cases, it may be appropriate to automatically replace the old model with the new one. But in other cases, the new model may require A/B testing before replacing the old model.

Some strategies to consider for live model testing are:

  • Champion vs. Challenger: serves production traffic to both models but only uses the prediction/response of the existing model (champion) in the application. Data from the challenger model is stored for analysis, but not used.
  • A/B testing: split the production traffic between the two models during a given experimentation period and compare key metrics at the end of the experiment to decide which model to promote.
  • Canary deployment: Start by redirecting a small percentage of production traffic to the new model. Since you are on a production path, this helps detect real problems with the new model, but limits the impact to a small percentage of users. Increase traffic to the new model until it receives 100% of the traffic.

4. Retraining feedback loop data

Once we identify that the model needs to be retained, the next step is to choose the right data set to retrain with. Here are some recommendations to ensure that new training data will improve model performance.

  • If the model performs well overall, but does not meet the optimal performance criteria in some segments, the new training data set should contain additional data points for these lower performing segments. A simple bottom-up sampling strategy can be used to create a new training dataset targeting these underperforming segments.
  • If the model is trained on a small time interval, the training data set may not accurately capture and represent all possible patterns that will appear in the live production data. To avoid this, avoid training the model only on recent data.
  • If your model architecture follows the transfer learning design, it is sufficient to add new data to the model during retraining, without losing the patterns that the model has already learned from previous training data.

5. Measurable ROI

Measuring cost impact varies by deployment environment (cloud vs. edge).

Cloud:

While it is difficult to calculate the direct ROI of some AI tasks, the value of optimized model retraining is simple, tangible, and possible to calculate directly. The computation and storage costs of model training jobs are often already recorded as part of cloud computing costs. Often, the business impact of a model can also be calculated.

When optimizing retraining, we consider both retraining costs and the impact of model performance on the business ("AI ROI"). We can weigh these costs against each other to justify the cost of retraining models.

Retraining Cost =  (compute cost for retraining + cost of storing new model) x frequency    

Edge:

Edge retraining can have advantages, such as data privacy and reduced latency, as data does not have to be transmitted over a network and can remain on the device. In addition, Edge retraining may be necessary to adapt the model to changes in the environment.

The cost of retraining machine learning models on the Edge depends on several factors, such as the size and complexity of the model, the quantity and quality of the available data, the processing capacity of the Edge Processing Unit (EPU), and the cost of power.

In general, the process of retraining machine learning models on the Edge can be more expensive than doing so in the cloud due to the resource limitations of the EPU and the need to transmit data over a network, which can be slow and costly. In addition, machine learning models often require large amounts of data to train, which can require a large amount of storage on the Edge.

However, there are also techniques and tools to reduce the cost of retraining on the Edge, such as using federated learning techniques to filter out only the necessary data, transfer learning to take advantage of pre-trained models, optimizing models for low-power devices, and carefully selecting training data to reduce the size of the required data set.

Transitioning from fixed-interval model retraining to automated model retraining triggered by model performance offers numerous benefits to organizations, from lower IT costs at a time when cloud costs are rising to improved ROI from artificial intelligence through improved model performance.

Barbara, The Cybsersecure Edge Platform for MLOps

Barbara Industrial Edge Platform is a powerful tool that can help organizations simplify and accelerate their Edge ML deployments, building, orchestrating and maintaining easily container-based or native applications across thousands of distributed edge nodes.

The most important data of the Industry starts ‘at the edge’ across thousands of IoT devices, industrial plants and equipment machines. Discover how to turn data into real-time insight and actions, with the most efficient, zero-touch and economic platform.

Request a demonstration.