Model Development and offline evaluation -

Machine Learning System Design

About Lesson

Model development is an iterative process. After each iteration, you’ll want to compare your model’s performance against its performance in previous iterations and evaluate how suitable this iteration is for production.

When selecting a model for your problem, you don’t choose from every possible model out there, but usually focus on a set of models suitable for your problem. Knowledge of common ML tasks and the typical approaches to solve them is essential in this process. Time and compute power are limited resources and you have to be strategic about what models you select.

When considering what model to use, it’s important to consider not only the model’s performance, measured by metrics such as accuracy, F1 score and log loss, but also its other properties, such as how much data, compute and time it needs to train, what’s its inference latency and interpretability. For example, a simple logistic regression model might have lower accuracy than a complex neural network, but it requires less labelled data to start, it’s much faster to train, it’s much easier to deploy and it’s also much easier to explain why it’s making certain predictions.

Non-neural network algorithms tend to be more explainable than neural networks.

Tips for model selection:

Avoid state-of-the-art trap
- Researchers often only evaluate models in academic settings, which means that a model being state of the art often means that it performs better than existing models on some static datasets.
- It doesn’t mean that the state-of-the-art model will be fast enough or cheap enough for you to implement. It doesn’t even mean that this model will perform better than other models on your data.
- While it’s essential to stay up to date with new technologies and beneficial to evaluate them for your business, the most important thing to do when solving a problem is finding solutions that can solve that problem.
- If there’s a solution that can solve your problem that is much cheaper and simpler than state-of-the-art models, use the simpler solution.
Start with simpler models
- Simpler models are easier to deploy and deploying your model early allows you to validate that your prediction pipeline is consistent with your training pipeline.
- Starting with something simple and adding more complex components step-by-step makes it easier to understand your model and debug it.
- A simple model may serve as a baseline to which you can compare your more complex models.
Avoid human biases in selecting models
- There are a lot of human biases in evaluating models. Part of the process of evaluating an ML architecture is to experiment with different features and different sets of hyper-parameters to find the best model of that architecture. If an engineer is more excited about an architecture, they will likely spend a lot more time experimenting with it, which might result in better-performing models for that architecture.
- When comparing different architectures, it’s important to compare them under comparable setups. If you run 100 experiments for an architecture, it’s not fair to only run a couple of experiments for the architecture you’re evaluating it against. You might need to run 100 experiments for the other architecture too.
- Because the performance of a model architecture depends heavily on the context it’s evaluated in—e.g., task, training data, test data, hyper-parameters, etc.—it’s extremely difficult to make claims that a model architecture is better than another architecture. The claim might be true in a context, but unlikely true for all possible contexts.
Evaluate good performance now versus good performance later
- The best model now does not always mean the best model two months from now.
- A tree-based model might work better now because you don’t have a ton of data yet, but two months from now, you might be able to double your amount of training data and your neural network might perform much better.
- If a learning algorithm suffers from high bias, getting more training data by itself won’t help much. Whereas if a learning algorithm suffers from high variance, getting more training data is likely to help.
- A simple way to estimate how your model’s performance might change with more data is to use learning curves.
- A learning curve of a model is a plot of its performance—e.g., training loss, training accuracy, validation accuracy—against the number of training samples it uses.
- The learning curve won’t help you estimate exactly how much performance gain you can get from having more training data, but it can give you a sense of whether you can expect any performance gain at all from more training data.
- While evaluating models, you might want to take into account their potential for improvements in the near future and how easy/difficult it is to achieve those improvements.
Evaluate trade-offs
- Understanding what’s more important in the performance of your ML system will help you choose the most suitable model.
- Trade-offs
  - false positives vs false negatives
  - compute requirement vs accuracy
  - interpretability vs performance
Understand your model’s assumptions
- Understanding what assumptions a model makes and whether our data satisfies those assumptions can help you evaluate which model works best for your use case.
- Common assumptions
  - Prediction assumption
    - Every model that aims to predict an output Y from an input X makes the assumption that it’s possible to predict Y based on X.
  - IID(independent and identically distributed)
    - Neural networks assume that the examples are independent and identically distributed, which means that all the examples are independently drawn from the same joint distribution.
  - Smoothness
    - Every supervised machine learning method assumes that there’s a set of functions that can transform inputs into outputs such that similar inputs are transformed into similar outputs.
    - If an input X produces an output Y, then an input close to X would produce an output proportionally close to Y.
  - Tractability
    - Let X be the input and Z be the latent representation of X. Every generative model makes the assumption that it’s tractable to compute the probability P(Z|X).
  - Boundaries
    - A linear classifier assumes that decision boundaries are linear.
  - Conditional independence
    - A Naive Bayes classifier assumes that the attribute values are independent of each other given the class.
  - Normally distributed
    - Many statistical methods assume that data is normally distributed.

Ensembles

After developing one single model, you might think about how to continue improving its performance. One method that has consistently given a performance boost is to use an ensemble of multiple models instead of just an individual model to make predictions.
Each model in the ensemble is called a base learner.
For example, for the task of predicting whether an email is SPAM, you might have three different models. The final prediction for each email is the majority vote of all three models. So if at least two base learners output SPAM, the email will be classified as SPAM.
Ensembling methods are less favoured in production because ensembles are more complex to deploy and harder to maintain.
When creating an ensemble, the less correlation there is among base learners, the better the ensemble will be.
There are three ways to create an ensemble: bagging, boosting and stacking.
In addition to helping boost performance, according to several survey papers, ensemble methods such as boosting and bagging, together with resampling, have shown to help with imbalanced datasets.

Bagging(bootstrap aggregating)

Designed to improve both the training stability and accuracy of ML algorithms.
It reduces variance and helps to avoid overfitting.
Given a dataset, instead of training one classifier on the entire dataset, you sample with replacement to create different datasets, called bootstraps and train a classification or regression model on each of these bootstraps. Sampling with replacement ensures that each bootstrap is created independently from its peers.
In case of a classification problem, the final prediction is decided by the majority vote of all models.
If it is a regression task, the final prediction is the average of all models’ predictions.
Bagging generally improves unstable methods, such as neural networks, classification and regression trees and subset selection in linear regression. However, it can mildly degrade the performance of stable methods such as k-nearest neighbours.
A random forest is an example of bagging. A random forest is a collection of decision trees constructed by both bagging and feature randomness, where each tree can pick only from a random subset of features to use.

Boosting

Boosting is a family of iterative ensemble algorithms that convert weak learners to strong ones.
Each learner in this ensemble is trained on the same set of samples, but the samples are weighted differently among iterations. As a result, future weak learners focus more on the examples that previous weak learners misclassified.
Steps:
- You start by training the first weak classifier on the original dataset.
- Samples are re-weighted based on how well the first classifier classifies them, e.g., misclassified samples are given higher weight.
- Train the second classifier on this re-weighted dataset. Your ensemble now consists of the first and the second classifiers.
- Samples are weighted based on how well the ensemble classifies them.
- Train the third classifier on this re-weighted dataset. Add the third classifier to the ensemble.
- Repeat for as many iterations as needed.
- Form the final strong classifier as a weighted combination of the existing classifiers—classifiers with smaller training errors have higher weights.
Examples:
- Gradient boosting machine (GBM): produces a prediction model typically from weak decision trees. It builds the model in a stage-wise fashion like other boosting methods do and it generalises them by allowing optimisation of an arbitrary differentiable loss function.
- XGBoost, a variant of GBM, used in a wide range of tasks from classification, ranking, to the discovery of the Higgs Boson, ML competitions.
- LightGBM, a distributed gradient boosting framework that allows parallel learning, which generally allows faster training on large datasets.

Stacking

Stacking means that you train base learners from the training data then create a meta-learner that combines the outputs of the base learners to output final predictions.
The meta-learner can be as simple as a heuristic: you take the majority vote (for classification tasks) or the average vote (for regression tasks) from all base learners. It can be another model, such as a logistic regression model or a linear regression model.

Experiment tracking, versioning and debugging

It’s important to keep track of all the definitions needed to re-create an experiment and its relevant artefacts.
An artefact is a file generated during an experiment—examples of artefacts can be files that show the loss curve, evaluation loss graph, logs or intermediate results of a model throughout a training process. This enables you to compare different experiments and choose the one best suited for your needs.
The process of tracking the progress and results of an experiment is called experiment tracking.
The process of logging all the details of an experiment for the purpose of possibly recreating it later or comparing it with other experiments is called versioning.
Experiment tracking
- Many problems can arise during the training process, including loss not decreasing, overfitting, underfitting, fluctuating weight values, dead neurons and running out of memory.
- It’s important to track what’s going on during training not only to detect and address these issues but also to evaluate whether your model is learning anything useful.
- List of things to consider tracking for each experiment during its training process:
  - The loss curve corresponding to the train split and each of the eval splits.
  - The model performance metrics that you care about on all non-test splits, such as accuracy, F1 score, perplexity, etc.
  - The log of corresponding sample, prediction and ground truth label. This comes in handy for ad-hoc analytics and sanity check.
  - The speed of your model, evaluated by the number of steps per second or if your data is text, the number of tokens processed per second.
  - System performance metrics such as memory usage and CPU/GPU utilisation. They’re important to identify bottlenecks and avoid wasting system resources.
  - The values over time of any parameter and hyper-parameter whose changes can affect your model’s performance, such as the learning rate if you use a learning rate schedule; gradient norms (both globally and per layer), especially if you’re clipping your gradient norms; and weight norm, especially if you’re doing weight decay.
- In theory, it’s not a bad idea to track everything you can.
- Tracking gives you observability into the state of your model.
- However, in practice, due to the limitations of tooling today, it can be overwhelming to track too many things and tracking less important things can distract you from tracking what is really important.
- Experiment tracking enables comparison across experiments. By observing how a certain change in a component affects the model’s performance, you gain some understanding into what that component does.
- A simple way to track your experiments is to automatically make copies of all the code files needed for an experiment and log all outputs with their timestamps.
Versioning
- ML systems are part code, part data, so you need to not only version your code but your data as well.
- Data versioning is challenging:
  - Data is often much larger than code.
  - There is still confusion in what exactly constitutes a diff when we version data.
  - It’s unclear how to resolve merge conflicts.
  - Data regulations like GDPR make versioning data complicated.
Debugging ML models
- Debugging ML models can be especially frustrating.
  - ML models fail silently. The code compiles. The loss decreases as it should. The correct functions are called. The predictions are made, but the predictions are wrong. The developers don’t notice the errors. And worse, users don’t either and use the predictions as if the application was functioning as it should.
  - Even when you think you’ve found the bug, it can be frustratingly slow to validate whether the bug has been fixed. In some cases, you can’t even be sure whether the bugs are fixed until the model is deployed to the users.
  - Debugging ML models is hard because of their cross-functional complexity. There are many components in an ML system: data, labels, features, ML algorithms, code, infrastructure, etc. These different components might be owned by different teams. For example, data is managed by data engineers, labels by subject matter experts, ML algorithms by data scientists and infrastructure by ML engineers or the ML platform team. When an error occurs, it could be because of any of these components or a combination of them, making it hard to know where to look or who should be looking into it.
- Things that might cause an ML model to fail:
  - Theoretical constraints
    - Each model comes with its own assumptions about the data and the features it uses. A model might fail because the data it learns from doesn’t conform to its assumptions. For example, you use a linear model for the data whose decision boundaries aren’t linear.
  - Poor implementation of model
    - The model might be a good fit for the data, but there could be bugs in the model implementation.
  - Poor choice of hyper-parameters
    - With the same model, one set of hyper-parameters can give you state-of-the-art result but another set of hyper-parameters might cause the model to never converge. The model could be a great fit for your data and its implementation may be correct, but a poor set of hyper-parameters might render your model useless.
  - Data problems
    - There are many things that could go wrong in data collection and preprocessing that might cause your models to perform poorly, such as data samples and labels being incorrectly paired, noisy labels, features normalised using outdated statistics and more.
  - Poor choice of features
    - There might be many possible features for your models to learn from. Too many features might cause your models to overfit to the training data or cause data leakage. Too few features might lack predictive power to allow your models to make good predictions.
- Debugging should be both preventive and curative. You should have healthy practices to minimise the opportunities for bugs to proliferate as well as a procedure for detecting, locating and fixing bugs. Having the discipline to follow both the best practices and the debugging procedure is crucial in developing, implementing and deploying ML models.

Distributed training

It’s common to train a model using data that doesn’t fit into memory.
When your data doesn’t fit into memory, your algorithms for pre-processing (e.g., zero-centring, normalising, whitening), shuffling and batching data will need to run out of core and in parallel.
Out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer’s main memory at once.
When a sample of your data is large, e.g., one machine can handle a few samples at a time, you might only be able to work with a small batch size, which leads to instability for gradient descent-based optimisation.
In some cases, a data sample is so large it can’t even fit into memory and you will have to use something like gradient checkpointing, a technique that leverages the memory footprint and compute trade-off to make your system do more computation with less memory.
Even when a sample fits into memory, using checkpointing can allow you to fit more samples into a batch, which might allow you to train your model faster.
Data parallelism
- You split your data on multiple machines, train your model on all of them and accumulate gradients.
- A challenging problem is how to accurately and effectively accumulate gradients from different machines. As each machine produces its own gradient, if your model waits for all of them to finish a run—synchronous stochastic gradient descent (SGD)— stragglers will cause the entire system to slow down, wasting time and resources. The straggler problem grows with the number of machines, as the more the workers, the more likely that at least one worker will run unusually slowly in a given iteration.
- If your model updates the weight using the gradient from each machine separately—asynchronous SGD—gradient staleness might become a problem because the gradients from one machine have caused the weights to change before the gradients from another machine have come in.
- In theory, asynchronous SGD converges but requires more steps than synchronous SGD. However, in practice, when the number of weights is large, gradient updates tend to be sparse, meaning most gradient updates only modify small fractions of the parameters and it’s less likely that two gradient updates from different machines will modify the same weights. When gradient updates are sparse, gradient staleness becomes less of a problem and the model converges similarly for both synchronous and asynchronous SGD.
- To oversimplify the calculation, if training an epoch on a machine takes 1M steps, training on 1,000 machines might take only 1,000 steps. An intuitive approach is to scale up the learning rate to account for more learning at each step, but we also can’t make the learning rate too big as it will lead to unstable convergence. In practice, increasing the batch size past a certain point yields diminishing returns.
- With the same model setup, the main worker sometimes uses a lot more resources than other workers. The easiest way, but not the most effective way, is to use a smaller batch size on the main worker and a larger batch size on other workers.
Model parallelism
- Model parallelism is when different components of your model are trained on different machines.
- Pipeline parallelism is a clever technique to make different components of a model on different machines run more in parallel. The key idea is to break the computation of each machine into multiple parts.

AutoML

AutoML refers to automating the process of finding ML algorithms to solve real-world problems. A popular form of AutoML in production is hyper-parameter tuning.

Soft AutoML: hyper-parameter tuning
- The goal of hyper-parameter tuning is to find the optimal set of hyper-parameters for a given model within a search space.
- Example: scikit-learn with auto-sklearn, TensorFlow with Keras Tuner and Ray with Tune.
- Popular methods for hyper-parameter tuning include random search, grid search and Bayesian optimisation.
- When tuning hyper-parameters, keep in mind that a model’s performance might be more sensitive to the change in one hyper-parameter than another and therefore sensitive hyper-parameters should be more carefully tuned.
- Graduate student descent (GSD) is a technique in which a graduate student fiddles around with the hyper-parameters until the model works.
Hard AutoML: Architecture search and learnt optimiser
- The goal here is to give your algorithm some building blocks and let it figure out how to combine them. This area of research is called architectural search or Neural Architecture Search(NAS).
- A NAS setup consists of three components:
  - Search space
    - Defines possible model architectures—i.e., building blocks to choose from and constraints on how they can be combined.
  - Performance estimation strategy
    - To evaluate the performance of a candidate architecture without having to train each candidate architecture from scratch until convergence. When we have a large number of candidate architectures, say 1,000, training all of them until convergence can be costly.
  - Search strategy
    - To explore the search space. A simple approach is random search—randomly choosing from all possible configurations—which is unpopular because it’s prohibitively expensive even for NAS. Common approaches include reinforcement learning (rewarding the choices that improve the performance estimation) and evolution (adding mutations to an architecture, choosing the best-performing ones, adding mutations to them and so on).

Model offline evaluation

Partner with the business team to develop metrics for model evaluation that are more relevant to your company’s business.
Ideally, the evaluation methods should be the same during both development and production. But in many cases, the ideal is impossible because during development, you have ground truth labels, but in production, you don’t.
For certain tasks, it’s possible to infer or approximate labels in production based on users’ feedback.
For other tasks, you might not be able to evaluate your model’s performance in production directly and might have to rely on extensive monitoring to detect changes and failures in your ML system’s performance.
Once your model is deployed, you’ll need to continue monitoring and testing your model in production.

Baselines

Evaluation metrics, by themselves, mean little. When evaluating your model, it’s essential to know the baseline you’re evaluating against.
Five baselines that might be useful across use cases:
- Random baseline
  - The predictions are generated at random following a specific distribution, which can be the uniform distribution or the task’s label distribution.
- Simple heuristic
  - If you just make predictions based on simple heuristics, what performance would you expect?
- Zero rule baseline
  - The zero rule baseline is a special case of the simple heuristic baseline when your baseline model always predicts the most common class.
- Human baseline
  - In many cases, the goal of ML is to automate what would have been otherwise done by humans, so it’s useful to know how your model performs compared to human experts.
- Existing solutions

Evaluation methods

In academic settings, when evaluating ML models, people tend to fixate on their performance metrics. However, in production, we also want our models to be robust, fair, calibrated and overall make sense.
Perturbation tests
- Ideally, the inputs used to develop your model should be similar to the inputs your model will have to work with in production, but it’s not possible in many cases. This is especially true when data collection is expensive or difficult and the best available data you have access to for training is still very different from your real-world data.
- You might want to choose the model that works best on the perturbed data instead of the one that works best on the clean data.
- The more sensitive your model is to noise, the harder it will be to maintain it. It also makes your model susceptible to adversarial attack.
Invariance tests
- Certain changes to the inputs shouldn’t lead to changes in the output. If these happen, there are biases in your model, which might render it unusable no matter how good its performance is.
- To avoid these biases, one solution is to keep the inputs the same but change the sensitive information to see if the outputs change.
- Better, you should exclude the sensitive information from the features used to train the model in the first place.
Directional expectation tests
- Certain changes to the inputs should, however, cause predictable changes in outputs.
- If the outputs change opposite to the expected direction, your model might not be learning the right thing and you need to investigate it further before deploying it.
Model calibration
- If a model predicts that team A will beat team B with a 70% probability, and out of the 1,000 times these two teams play together, team A only wins 60% of the time, then we say that this model isn’t calibrated. A calibrated model should predict that team A wins with a 60% probability.
- See this for more information.
Confidence measurement
- Confidence measurement can be considered a way to think about the usefulness threshold for each individual prediction.
- Confidence measurement is a metric for each individual sample.
Slice-based evaluation
- Slicing means to separate your data into subsets and look at your model’s performance on each subset separately.
- The focus on overall performance is harmful not only because of the potential public backlash, but also because it blinds the company to huge potential model improvements.
- A fascinating and seemingly counterintuitive reason why slice-based evaluation is crucial is Simpson’s paradox, a phenomenon in which a trend appears in several groups of data but disappears or reverses when the groups are combined.
- To track your model’s performance on critical slices, you’d first need to know what your critical slices are. Three main approaches to discover critical slices in your data.
  - Heuristic-based
    - Slice your data using domain knowledge you have of the data and the task at hand.
  - Error analysis
    - Manually go through misclassified examples and find patterns among them.
  - Slice finder
    - The process generally starts with generating slice candidates with algorithms such as beam search, clustering, or decision tree, then pruning out clearly bad candidates for slices and then ranking the candidates that are left.