Test in production -

Machine Learning System Design

About Lesson

Here we talk about ways to test your systems with live data in production to ensure that your updated model indeed works without catastrophic consequences.

Test in production is complementary to monitoring. The goal of both monitoring and test in production is to understand a model’s performance and figure out when to update it.

Continual Learning

The goal of continual learning is to safely and efficiently automate the update. Very few companies use a model that updates itself with every incoming sample in production. If your model is a neural network, learning with every incoming sample makes it susceptible to catastrophic forgetting.

Catastrophic forgetting refers to the tendency of a neural network to completely and abruptly forget previously learnt information upon learning new information.

It can make training more expensive—most hardware backends today were designed for batch processing, so processing only one sample at a time causes a huge waste of compute power and is unable to exploit data parallelism.

Companies that employ continual learning in production update their models in micro-batches. For example, they might update the existing model after every 512 or 1,024 examples—the optimal number of examples in each micro-batch is task dependent. The updated model shouldn’t be deployed until it’s been evaluated.

Stateless retraining vs stateful training

Continual learning isn’t about the retraining frequency, but the manner in which the model is retrained.

Stateless retraining means the model is trained from scratch each time. Continual learning means allowing stateful training also—the model continues training on new data. Stateful training is also known as fine-tuning or incremental learning.

Stateful training allows you to update your model with less data. Training a model from scratch tends to require a lot more data than fine-tuning the same model.

With stateful training, it might be possible to avoid storing data altogether, which would be useful for data with strict privacy requirements.

Stateful training doesn’t mean no training from scratch. The companies that have most successfully used stateful training also occasionally train their models from scratch on a large amount of data to calibrate it. Alternatively, they might also train their model from scratch in parallel with stateful training and then combine both updated models using techniques such as parameter server.

Once your infrastructure is set up to allow both stateless retraining and stateful training, the training frequency is just a knob to twist. You can update your models once an hour, once a day or whenever a distribution shift is detected.

Continual learning is about setting up infrastructure in a way that allows you, a data scientist or ML engineer, to update your models whenever it is needed, whether from scratch or fine-tuning and to deploy this update quickly.

Two types of model updates:

Model iteration
- A new feature is added to an existing model architecture or the model architecture is changed.
Data iteration
- The model architecture and features remain the same, but you refresh this model with new data.

Stateful training is mostly applied for data iteration, as changing your model architecture or adding a new feature requires training the resulting model from scratch.

There has been research showing that it might be possible to bypass training from scratch for model iteration by using techniques such as knowledge transfer and model surgery.

Why continual learning?

The first use case of continual learning is to combat data distribution shifts, especially when shifts happen suddenly.
Another use case of continual learning is to adapt to rare events.
A huge challenge for ML production today that continual learning can help overcome is the continuous cold start problem. The cold start problem arises when your model has to make predictions for a new user without any historical data. Continuous cold start is a generalisation of the cold start problem, as it can happen not just with new users but also with existing users.

Continual Learning challenges

Fresh data access challenge
- The first challenge is the challenge to get fresh data. If you want to update your model every hour, you need new data every hour.
- The best candidates for continual learning are tasks where you can get natural labels with short feedback loops.
- Examples of these tasks are dynamic pricing (based on estimated demand and availability), estimating time of arrival, stock price prediction, ads click-through prediction and recommender systems for online content like tweets, songs, short videos, articles, etc.
- The process of looking back into the logs to extract labels is called label computation. It can be quite costly if the number of logs is large. Label computation can be done with batch processing: e.g., waiting for logs to be deposited into data warehouses first before running a batch job to extract all labels from logs at once. However, this means that we’d need to wait for data to be deposited first, then wait for the next batch job to run.
- A much faster approach would be to leverage stream processing to extract labels from the real-time transports directly.
- If your model’s speed iteration is bottlenecked by labelling speed, it’s also possible to speed up the labelling process by leveraging programmatic labelling tools like Snorkel to generate fast labels with minimal human intervention. It might also be possible to leverage crowdsourced labels to quickly annotate fresh data.
Evaluation challenge
- The biggest challenge is in making sure that this update is good enough to be deployed.
- The risks for catastrophic failures amplify with continual learning. The more frequently you update your models, the more opportunities there are for updates to fail.
- Continual learning makes your models more susceptible to coordinated manipulation and adversarial attack. Because your models learn online from real-world data, it makes it easier for users to input malicious data to trick models into learning wrong things.
- To avoid bad incidents, it’s crucial to thoroughly test each of your model updates to ensure its performance and safety before deploying the updates to a wider audience.
Algorithm challenge
- To be precise, it only affects matrix-based and tree-based models that want to be updated very fast (e.g., hourly).
- It’s much easier to adapt models like neural networks than matrix-based and tree-based models to the continual learning paradigm.
- However, there have been algorithms to create tree-based models that can learn from incremental amounts of data, most notably Hoeffding Tree and its variants Hoeffding Window Tree and Hoeffding Adaptive Tree, but their uses aren’t yet widespread.
- Feature Scaling: Instead of using the mean or variance from all your data at once, you compute or approximate these statistics incrementally as you see new data, such as the algorithms outlined in Optimal Quantile Approximation in Streams.
- Example: sklearn’s StandardScaler has a partial_fit that allows a feature scaler to be used with running statistics—but the built-in methods are slow and don’t support a wide range of running statistics.

Four stages for continual learning:

Manual, stateless retraining
- In the beginning, the ML team often focuses on developing ML models to solve as many business problems as possible.
- Because your team is focussing on developing new models, updating existing models takes a backseat. You update an existing model only when the following two conditions are met: the model’s performance has degraded to the point that it’s doing more harm than good and your team has time to update it.
- The process of updating a model is manual and ad hoc.
Automated retraining
- After a few years, your team has managed to deploy models to solve most of the obvious problems.
- Your priority is no longer to develop new models, but to maintain and improve existing models. The ad hoc, manual process of updating models mentioned from the previous stage has grown into a pain point too big to be ignored.
- Your team decides to write a script to automatically execute all the retraining steps. This script is then run periodically using a batch process such as Spark.
- When creating scripts to automate the retraining process for your system, you need to take into account that different models in your system might require different retraining schedules.
Automated stateful training
- In this stage, you reconfigure your automatic updating script so that, when the model update is kicked off, it first locates the previous checkpoint and loads it into memory before continuing training on this checkpoint.
Continual learning
- Finding the optimal schedule isn’t straightforward and can be situation-dependent.
- Instead of relying on a fixed schedule, you might want your models to be automatically updated whenever data distributions shift and the model’s performance plummets.
- The holy grail is when you combine continual learning with edge deployment.
- You’ll first need a mechanism to trigger model updates. This trigger can be:
  - Time-based
    - For example, every five minutes.
  - Performance-based
    - For example, whenever model performance plummets.
  - Volume-based
    - For example, whenever the total amount of labelled data increases by 5%.
  - Drift-based
    - For example, whenever a major data distribution shift is detected.
- For this trigger mechanism to work, you’ll need a solid monitoring solution.
- You’ll also need a solid pipeline to continually evaluate your model updates.
- The hard part is to ensure that the updated model is working properly.

How often to update your models?

Value of data freshness
- The question of how often to update a model becomes a lot easier if we know how much the model performance will improve as a result of the update.
- In practice, you might want your experiments to be much more fine-grained, operating not in months but in weeks, days, even hours or minutes.
Model iteration vs data iteration
- In theory, you can do both types of updates and in practice, you should do both from time to time. However, the more resources you spend in one approach, the fewer resources you can spend in the other.
- On one hand, if you find that iterating on your data doesn’t give you much performance gain, then you should spend your resources on finding a better model. On the other hand, if finding a better model architecture requires 100X compute for training and gives you 1% performance whereas updating the same model on data from the last three hours requires only 1X compute and also gives 1% performance gain, you’ll be better off iterating on data.
- It’s important to run experiments to quantify the value of data freshness to your models.

Test in Production

To sufficiently evaluate your models, you first need a mixture of offline evaluation and online evaluation. To understand why offline evaluation isn’t enough, let’s go over two major test types for offline evaluation: test splits and backtests.

If you update the model to adapt to a new data distribution, it’s not sufficient to evaluate this new model on test splits from the old distribution.

The method of testing a predictive model on data from a specific period of time in the past is known as a backtest.

The question is whether backtests are sufficient to replace static test splits. Not quite. If something went wrong with your data pipeline and some data from the last hour is corrupted, evaluating your model solely on this recent data isn’t sufficient.

Because data distributions shift, the fact that a model does well on the data from the last hour doesn’t mean that it will continue doing well on the data in the future. The only way to know whether a model will do well in production is to deploy it.

Techniques:

Shadow deployment
- Shadow deployment might be the safest way to deploy your model or any software update. Shadow deployment works as follows:
  - Deploy the candidate model in parallel with the existing model.
  - For each incoming request, route it to both models to make predictions, but only serve the existing model’s prediction to the user.
  - Log the predictions from the new model for analysis purposes.
- Only when you’ve found that the new model’s predictions are satisfactory, you replace the existing model with the new model.
- Because you don’t serve the new model’s predictions to users until you’ve made sure that the model’s predictions are satisfactory, the risk of this new model doing something funky is low, at least not higher than the existing model.
- However, this technique isn’t always favourable because it’s expensive. It doubles the number of predictions your system has to generate, which generally means doubling your inference compute cost.
A/B testing
- A/B testing is a way to compare two variants of an object, typically by testing responses to these two variants and determining which of the two variants is more effective.
- A/B testing works as follows:
  - Deploy the candidate model alongside the existing model.
  - A percentage of traffic is routed to the new model for predictions; the rest is routed to the existing model for predictions. It’s common for both variants to serve prediction traffic at the same time. However, there are cases where one model’s predictions might affect another model’s predictions—e.g., in ride-sharing’s dynamic pricing, a model’s predicted prices might influence the number of available drivers and riders, which, in turn, influence the other model’s predictions. In those cases, you might have to run your variants alternatively, e.g., serve model A one day and then serve model B the next day.
  - Monitor and analyse the predictions and user feedback, if any, from both models to determine whether the difference in the two models’ performance is statistically significant.
- To do A/B testing the right way requires doing many things right.
  - First, A/B testing consists of a randomised experiment: the traffic routed to each model has to be truly random. If not, the test result will be invalid.
  - Second, your A/B test should be run on a sufficient number of samples to gain enough confidence about the outcome.
- The gist here is that if your A/B test result shows that a model is better than another with statistical significance, you can determine which model is indeed better. To measure statistical significance, A/B testing uses statistical hypothesis testing such as two-sample tests.
- Statistical significance, while useful, isn’t foolproof.
- Even if your A/B test result isn’t statistically significant, it doesn’t mean that this A/B test fails.
Canary release
- Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.
- See more here.
- Steps:
  - Deploy the candidate model alongside the existing model. The candidate model is called the canary.
  - A portion of the traffic is routed to the candidate model.
  - If its performance is satisfactory, increase the traffic to the candidate model. If not, abort the canary and route all the traffic back to the existing model.
  - Stop when either the canary serves all the traffic (the candidate model has replaced the existing model) or when the canary is aborted.
- The candidate model’s performance is measured against the existing model’s performance according to the metrics you care about. If the candidate model’s key metrics degrade significantly, the canary is aborted and all the traffic will be routed to the existing model.
Interleaving experiments
- In experiments, Netflix found that interleaving reliably identifies the best algorithms with considerably smaller sample size compared to traditional A/B testing.
Bandits
- Multi-armed bandits are algorithms that allow you to balance between exploitation (choosing the slot machine that has paid the most in the past) and exploration (choosing other slot machines that may pay off even more).
- A/B testing is stateless: you can route traffic to each model without having to know about their current performance.
- When you have multiple models to evaluate, each model can be considered a slot machine whose payout (i.e., prediction accuracy) you don’t know. Bandits allow you to determine how to route traffic to each model for prediction to determine the best model while maximising prediction accuracy for your users. Bandit is stateful: before routing a request to a model, you need to calculate all models’ current performance.
- Bandits require less data to determine which model is the best and, at the same time, reduce opportunity cost as they route traffic to the better model more quickly.
- However, bandits are a lot more difficult to implement than A/B testing because it requires computing and keeping track of models’ payoffs. Therefore, bandit algorithms are not widely used in the industry other than at a few big tech companies.