Model deployment -

Machine Learning System Design

About Lesson

“Deploy” is a loose term that generally means running your model and making it accessible.

Software doesn’t age like fine wine. It ages poorly. The phenomenon in which a software program degrades over time even if nothing seems to have changed is known as “software rot” or “bit rot”.

ML systems suffer from what are known as data distribution shifts, when the data distribution your model encounters in production is different from the data distribution it was trained on.

Since a model’s performance decays over time, we want to update it as fast as possible.

The two main ways a model generates and serves its predictions to users are online prediction and batch prediction. The process of generating predictions is called inference.

Batch vs online predictions

Online prediction is when predictions are generated and returned as soon as requests for these predictions arrive.
Online prediction is also known as on-demand prediction. Example: Google Translate.
Batch prediction is when predictions are generated periodically or whenever triggered. The predictions are stored somewhere, such as in SQL tables or an in-memory database and retrieved as needed. Example: Netflix movie recommendation.
Batch prediction is also known as asynchronous prediction: predictions are generated asynchronously with requests.
Features computed from historical data, such as data in databases and data warehouses, are batch features. Features computed from streaming data—data in real-time transports—are streaming features. In batch prediction, only batch features are used. In online prediction, however, it’s possible to use both batch features and streaming features.
Online prediction and batch prediction don’t have to be mutually exclusive. One hybrid solution is that you precompute predictions for popular queries, then generate predictions online for less popular queries.

From batch prediction to online prediction

A downside to online prediction is that your model might take too long to generate predictions.
Batch prediction works well in situations when you want to generate a lot of predictions and don’t need the results immediately.
The problem with batch prediction is that it makes your model less responsive to users’ change in preferences.
Another challenge associated with batch prediction is that you need to know the requests in advance to generate predictions.
Examples where online prediction is crucial include high-frequency trading, autonomous vehicles, voice assistants, unlocking your phone using face or fingerprints, fall detection for elderly care and fraud detection.
Batch prediction can be a workaround when online prediction isn’t cheap or fast enough.
As hardware becomes more customised and powerful and better techniques are being developed to allow faster, cheaper online predictions, online prediction might become the default.

Having two different pipelines(one for training and another one for inference) to process your data is a common cause for bugs in ML production. Problem occurs when the changes in one pipeline aren’t correctly replicated in the other, leading to two pipelines extracting two different sets of features. This is especially common if the two pipelines are maintained by two different teams, such as the ML team maintains the batch pipeline for training while the deployment team maintains the stream pipeline for inference.

Model export

Exporting a model means converting this model into a format that can be used by another application(serialisation).
There are two parts of a model that you can export: model definition and model’s parameter values.
In TensorFlow, you might use tf.keras.Model.save() to export your model into TensorFlow’s SavedModel format. In PyTorch, you might use torch.onnx.export() to export your model into ONNX format.

Model compression

If the model you want to deploy takes too long to generate predictions, there are three main approaches to reduce its inference latency:
- Make it do inference faster (inference optimisation).
- Make the model smaller (model compression).
- Make the hardware it’s deployed on run faster.
Model compression techniques:
- Low-rank factorisation
  - The key idea behind low-rank factorisation is to replace high-dimensional tensors with lower-dimensional tensors.
  - This method has been used to develop smaller models with significant acceleration compared to standard models. However, it tends to be specific to certain types of models (e.g., compact convolutional filters are specific to convolutional neural networks) and requires a lot of architectural knowledge to design, so it’s not widely applicable to many use cases yet.
- Knowledge Distillation
  - It is a method in which a small model (student) is trained to mimic a larger model or ensemble of models (teacher).
  - Even though the student is often trained after a pre-trained teacher, both may also be trained at the same time.
  - Example: DistilBERT, which reduces the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster.
  - The advantage of this approach is that it can work regardless of the architectural differences between the teacher and the student networks.
  - The disadvantage of this approach is that it’s highly dependent on the availability of a teacher network.
  - This method is also sensitive to applications and model architectures and therefore hasn’t found wide usage in production.
- Pruning
  - Pruning was a method originally used for decision trees where you remove sections of a tree that are uncritical and redundant for classification.
  - Pruning, in the context of neural networks, has two meanings.
    - One is to remove entire nodes of a neural network, which means changing its architecture and reducing its number of parameters.
    - The more common meaning is to find parameters least useful to predictions and set them to 0. In this case, pruning doesn’t reduce the total number of parameters, only the number of non-zero parameters. The architecture of the neural network remains the same. This helps with reducing the size of a model because pruning makes a neural network more sparse and sparse architecture tends to require less storage space than dense structure.
  - Experiments show that pruning techniques can reduce the non-zero parameter counts of trained networks by over 90%, decreasing storage requirements and improving computational performance of inference without compromising overall accuracy. However, pruning can introduce biases into your model.
- Quantisation
  - Quantisation is the most general and commonly used model compression method. It’s straightforward to do and generalises over tasks and architectures.
  - Quantisation reduces a model’s size by using fewer bits to represent its parameters.
  - Quantisation not only reduces memory footprint but also improves the computation speed.
  - Reducing the number of bits to represent your numbers means that you can represent a smaller range of values. For values outside that range, you’ll have to round them up and/or scale them to be in range. Rounding numbers leads to rounding errors and small rounding errors can lead to big performance changes. You also run the risk of rounding/scaling your numbers to underflow/overflow and rendering it to 0.

ML on the cloud or on the edge

ML on the cloud means a large chunk of computation is done on the cloud, either public clouds or private clouds. Cost is a big downside to cloud deployment.

ML on the edge means a large chunk of computation is done on consumer devices—such as browsers, phones, laptops, smartwatches, cars, security cameras, robots, embedded devices, FPGAs (field programmable gate arrays) and ASICs (application-specific integrated circuits)—which are also known as edge devices.

The more computation is done on the edge, the less is required on the cloud and the less they’ll have to pay for servers.

There are many properties that make edge computing appealing.

It allows your applications to run where cloud computing cannot. Edge computing allows your models to work in situations where there are no internet connections or where the connections are unreliable, such as in rural areas.
When your models are already on consumers’ devices, you can worry less about network latency.
Putting your models on the edge is also appealing when handling sensitive user data.
Edge computing makes it easier to comply with regulations, like GDPR, about how user data can be transferred or stored. While edge computing might reduce privacy concerns, it doesn’t eliminate them altogether.

To move computation to the edge, the edge devices have to be powerful enough to handle the computation, have enough storage and memory to store ML models and load them into memory, as well as have enough battery or be connected to an energy source to power the application for a reasonable amount of time.