Monitoring -

Machine Learning System Design

About Lesson

Deploying a model isn’t the end of the process. A model’s performance degrades over time in production. Once a model has been deployed, we still have to continually monitor its performance to detect issues as well as deploy updates to fix these issues.

Data distribution shifts occur when the data distribution in production differs and diverges from the data distribution the model was exposed to during training.

Causes of ML System Failures

A failure happens when one or more expectations of the system is violated.
In traditional software, we mostly care about a system’s operational expectations: whether the system executes its logic within the expected operational metrics, e.g., latency and throughput.
For an ML system, we care about both its operational metrics and its ML performance metrics.

Software System Failures

Failures that would have happened to non-ML systems.

Dependency failure
- A software package or a codebase that your system depends on breaks, which leads your system to break.
Deployment failure
- Failures caused by deployment errors, such as when you accidentally deploy the binaries of an older version of your model instead of the current version, or when your systems don’t have the right permissions to read or write certain files.
Hardware failures
- When the hardware that you use to deploy your model, such as CPUs or GPUs, doesn’t behave the way it should.
Downtime or crash
- If a component of your system runs from a server somewhere, such as AWS or a hosted service and that server is down, your system will also be down.

ML-specific failures

Failures specific to ML systems.
Examples: data collection and processing problems, poor hyper-parameters, changes in the training pipeline not correctly replicated in the inference pipeline and vice versa, data distribution shifts that cause a model’s performance to deteriorate over time, edge cases and degenerate feedback loops.
Production data differing from training data
- The assumption is that the unseen data comes from a stationary distribution that is the same as the training data distribution.
- If the unseen data comes from a different distribution, the model might not generalise well.
- This assumption is incorrect in most cases for two reasons.
  - The underlying distribution of the real-world data is unlikely to be the same as the underlying distribution of the training data. Curating a training dataset that can accurately represent the data that a model will encounter in production turns out to be very difficult.
  - The real world isn’t stationary. Things change. Data distributions shift.
Edge cases
- Edge cases are the data samples so extreme that they cause the model to make catastrophic mistakes.
- Outliers refer to data: an example that differs significantly from other examples. Edge cases refer to performance: an example where a model performs significantly worse than other examples. An outlier can cause a model to perform unusually poorly, which makes it an edge case. However, not all outliers are edge cases.
Degenerate feedback loops
- A degenerate feedback loop can happen when the predictions themselves influence the feedback, which in turn, influences the next iteration of the model.
- A degenerate feedback loop is created when a system’s outputs are used to generate the system’s future inputs, which, in turn, influence the system’s future outputs.
- Degenerate feedback loops are especially common in tasks with natural labels from users, such as recommender systems and ads click-through-rate prediction.
- Degenerate feedback loops are one reason why popular movies, books or songs keep getting more popular, which makes it hard for new items to break into popular lists.
- Also called exposure bias, popularity bias, filter bubbles and sometimes echo chambers.
- Left unattended, degenerate feedback loops can cause your model to perform sub-optimally at best. At worst, they can perpetuate and magnify biases embedded in data.
Detecting degenerate feedback loops
- When a system is offline, degenerate feedback loops are difficult to detect. Degenerate loops result from user feedback and a system won’t have users until it’s online (i.e., deployed to users).
- For the task of recommender systems, it’s possible to detect degenerate feedback loops by measuring the popularity diversity of a system’s outputs even when the system is offline.
- If a recommender system is much better at recommending popular items than recommending less popular items, it likely suffers from popularity bias.
Correcting degenerate feedback loops
- Randomisation
  - Introducing randomisation in the predictions can reduce their homogeneity.
  - In the case of recommender systems, instead of showing the users only the items that the system ranks highly for them, we show users random items and use their feedback to determine the true quality of these items. This is the approach that TikTok follows.
  - Each new video is randomly assigned an initial pool of traffic (which can be up to hundreds of impressions). This pool of traffic is used to evaluate each video’s unbiased quality to determine whether it should be moved to a bigger pool of traffic or be marked as irrelevant.
  - Randomisation has been shown to improve diversity, but at the cost of user experience.
- Use positional features
  - If the position in which a prediction is shown affects its feedback in any way, you might want to encode the position information using positional features. Positional features can be numerical (e.g., positions are 1, 2, 3,…) or Boolean (e.g., whether a prediction is shown in the first position or not).

Data distribution shifts

Data distribution shift refers to the phenomenon in supervised learning when the data a model works with changes over time, which causes this model’s predictions to become less accurate as time passes.
The distribution of the data the model is trained on is called the source distribution. The distribution of the data the model runs inference on is called the target distribution.
General data distribution shifts
- Feature change: such as when new features are added, older features are removed, or the set of all possible values of a feature changes.
- Label schema change is when the set of possible values for label change.
- With regression tasks, label schema change could happen because of changes in the possible range of label values.
- With classification tasks, label schema change could happen because you have new classes.
- Label schema change is especially common with high-cardinality tasks.
- A model might suffer from multiple types of drift, which makes handling them a lot more difficult.
Detecting data distribution shifts
- Data distribution shifts are only a problem if they cause your model’s performance to degrade. So the first idea might be to monitor your model’s accuracy-related metrics—accuracy, F1 score, recall, AUC-ROC, etc.—in production to see whether they have changed.
- When ground truth labels are unavailable or too delayed to be useful, we can monitor other distributions of interest instead. The distributions of interest are the input distribution P(X), the label distribution P(Y) and the conditional distributions P(X|Y) and P(Y|X).
- There have been efforts to understand and detect label shifts without labels from the target distribution: Black Box Shift Estimation.
- In the industry, most drift detection methods focus on detecting changes in the input distribution, especially the distributions of features.
- Statistical methods
  - A simple method to detect whether the two distributions are the same is to compare their statistics like min, max, mean, median, variance, various quantiles (such as 5th, 25th, 75th, or 95th quantile), skewness, kurtosis, etc.
  - If those metrics differ significantly, the inference distribution might have shifted from the training distribution. However, if those metrics are similar, there’s no guarantee that there’s no shift.
  - A more sophisticated solution is to use a two-sample hypothesis test, shortened as two-sample test. It’s a test to determine whether the difference between two populations (two sets of data) is statistically significant.
  - A caveat is that just because the difference is statistically significant doesn’t mean that it is practically important.
  - A basic two-sample test is the Kolmogorov–Smirnov test (KS test). It’s a non-parametric statistical test, which means it doesn’t require any parameters of the underlying distribution to work. However, one major drawback of the KS test is that it can only be used for one-dimensional data.
  - Another test is Least-Squares Density Difference (LSDD), an algorithm that is based on the least squares density-difference estimation method.
  - There is also MMD, Maximum Mean Discrepancy, a kernel-based technique for multivariate two-sample testing and its variant Learned Kernel MMD.
  - See Alibi Detect (Open source package).
  - Because two-sample tests often work better on low-dimensional data than on high-dimensional data, it’s highly recommended that you reduce the dimensionality of your data before performing a two-sample test on it.
Addressing data distribution shifts
- How companies address data shifts depends on how sophisticated their ML infrastructure setups are.
- At some point in the future—say three or six months down the line—they might realise that their initial deployed models have degraded to the point that they do more harm than good. They will then need to adapt their models to the shifted distributions or to replace them with other solutions.
- Many companies assume that data shifts are inevitable, so they periodically retrain their models—once a month, once a week, or once a day—regardless of the extent of the shift. How to determine the optimal frequency to retrain your models is an important decision that many companies still determine based on gut feelings instead of experimental data.
- To make a model work with a new distribution in production, there are three main approaches.
  - The first is the approach that currently dominates research: train models using massive datasets. The hope here is that if the training dataset is large enough, the model will be able to learn such a comprehensive distribution that whatever data points the model will encounter in production will likely come from this distribution.
  - The second approach, less popular in research, is to adapt a trained model to a target distribution without requiring new labels.
  - The third approach is what is usually done in the industry today: retrain your model using the labelled data from the target distribution. However, retraining your model is not so straightforward. Retraining can mean retraining your model from scratch on both the old and new data or continuing training the existing model on new data. The latter approach is also called fine-tuning.

Monitoring and observability

Monitoring refers to the act of tracking, measuring and logging different metrics that can help us determine when something goes wrong.
Observability means setting up our system in a way that gives us visibility into our system to help us investigate what went wrong.
The process of setting up our system in this way is also called instrumentation.
Operational metrics: Designed to convey the health of your systems. They are generally divided into three levels: the network the system is run on, the machine the system is run on and the application that the system runs.
ML-specific metrics
- If your system receives any type of user feedback for the predictions it makes—click, hide, purchase, upvote, downvote, favourite, bookmark, share, etc.—you should definitely log and track it.
- Prediction is the most common artefact to monitor.
- ML monitoring solutions in the industry focus on tracking changes in features, both the features that a model uses as inputs and the intermediate transformations from raw inputs into final features.
- The first step of feature monitoring is feature validation: ensuring that your features follow an expected schema.
- Open source packages for feature validation: Great Expectations and Deequ.
- Beyond basic feature validation, you can also use two-sample tests to detect whether the underlying distribution of a feature or a set of features has shifted.
Monitoring toolbox
- Measuring, tracking and interpreting metrics for complex systems is a non-trivial task and engineers rely on a set of tools to help them do so.
- Logs
  - Traditional software systems rely on logs to record events produced at runtime.
  - The number of logs can grow very large very quickly.
  - When something goes wrong, you’ll need to query your logs for the sequence of events that caused it, a process that can feel like searching for a needle in a haystack.
  - When we log an event, we want to make it as easy as possible for us to find it later. This practice with micro-service architecture is called distributed tracing.
- Dashboards
  - A series of numbers might mean nothing to you, but visualising them on a graph might reveal the relationships among these numbers. Dashboards to visualise metrics are critical for monitoring.
  - Another use of dashboards is to make monitoring accessible to non-engineers.
  - Excessive metrics on a dashboard can also be counter-productive, a phenomenon known as dashboard rot.
- Alerts
  - When our monitoring system detects something suspicious, it’s necessary to alert the right people about it. An alert consists of the following three components:
    - Alert policy
      - This describes the condition for an alert. You might want to create an alert when a metric breaches a threshold, optionally over a certain duration.
    - Notification channels
      - These describe who is to be notified when the condition is met.
    - Description of the alert
      - This helps the alerted person understand what’s going on. The description should be as detailed as possible.
  - It’s important to set meaningful conditions so that only critical alerts are sent out.
Observability
- Observability refers to bringing better visibility into understanding the complex behaviour of software using outputs collected from the system at run time.
- A system’s outputs collected at runtime are also called telemetry.
- Observability allows more fine-grain metrics.
- In ML, observability encompasses interpretability.
- Interpretability helps us understand how an ML model works and observability helps us understand how the entire ML system, which includes the ML model, works.