Infrastructure and tooling -

Machine Learning System Design

About Lesson

In the ML world, infrastructure is the set of fundamental facilities that support the development and maintenance of ML systems.

Fundamental facilities

Storage and compute
- The storage layer is where data is collected and stored. The compute layer provides the compute needed to run your ML workloads such as training a model, computing features, generating features, etc.
Resource management
- Resource management comprises of tools to schedule and orchestrate your workloads to make the most out of your available compute resources. Examples of tools in this category include Airflow, Kubeflow and Metaflow.
ML platform
- This provides tools to aid the development of ML applications such as model stores, feature stores and monitoring tools. Examples of tools in this category include SageMaker and MLflow.
Development environment
- It is where code is written and experiments are run. Code needs to be versioned and tested. Experiments need to be tracked.

Model deployment

A deployment service can help with both pushing your models and their dependencies to production and exposing your models as endpoints.
All major cloud providers offer tools for deployment: AWS with SageMaker, GCP with Vertex AI, Azure with Azure ML.
There are startups that offer model deployment tools such as MLflow Models, Seldon.
When looking into a deployment tool, it’s important to consider how easy it is to do both online prediction and batch prediction with the tool.
Many companies have separate deployment pipelines for online prediction and batch prediction.
An open problem with model deployment is how to ensure the quality of a model before it’s deployed.
When choosing a deployment service, you might want to check whether this service makes it easy for you to perform the tests that you want.

Model store

Storing the model alone in blob storage isn’t enough. To help with debugging and maintenance, it’s important to track as much information associated with a model as possible.
Eight types of artefacts that you might want to store:
- Model definition
  - This is the information needed to create the shape of the model, e.g., what loss function it uses. If it’s a neural network, this includes how many hidden layers it has and how many parameters are in each layer.
- Model parameters
  - These are the actual values of the parameters of your model. These values are then combined with the model definition to re-create a model that can be used to make predictions. Some frameworks allow you to export both the parameters and the model definition together.
- Featurise and predict functions
  - Given a prediction request, how do you extract features and input these features into the model to get back a prediction? The featurise and predict functions provide the instructions to do so. These functions are usually wrapped in endpoints.
- Dependencies
  - The dependencies—e.g., Python version, Python packages—needed to run your model are usually packaged together into a container.
- Data
  - The data used to train this model might be pointers to the location where the data is stored or the name/version of your data. If you use tools like DVC to version your data, this can be the DVC commit that generated the data.
- Model generation code
  - This is the code that specifies how your model was created, such as:
    - What frameworks it used?
    - How was it trained?
    - Details on how the train/validate/test splits were created.
    - Number of experiments run.
    - Range of hyper-parameters considered.
    - Actual set of hyper-parameters that final model used.
  - Very often, data scientists generate models by writing code in notebooks. Companies with more mature pipelines make their data scientists commit the model generation code into their Git repos on GitHub or GitLab. However, in many companies, this process is ad hoc and data scientists don’t even check-in their notebooks. If the data scientist responsible for the model loses the notebook or quits or goes on vacation, there’s no way to map a model in production to the code that generated it for debugging or maintenance.
- Experiment artefacts
  - These are the artefacts generated during the model development process. These artefacts can be graphs like the loss curve or raw numbers like the model’s performance on the test set.
- Tags
  - This includes tags to help with model discovery and filtering, such as owner (the person or the team who is the owner of this model) or task (the business problem this model solves, like fraud detection).

Feature Store

At its core, there are three main problems that a feature store can help address:
- Feature management
  - It’s often the case that features used for one model can be useful for another model.
  - A feature store can help teams share and discover features, as well as manage roles and sharing settings for each feature.
  - Example: Amundsen, Datahub
- Feature transformation/computation
  - Feature engineering logic, after being defined, needs to be computed.
  - If the computation of this feature isn’t too expensive, it might be acceptable computing this feature each time it is required by a model. However, if the computation is expensive, you might want to execute it only once (first time it is required), then store it for future uses.
  - A feature store can help with both performing feature computation and storing the results of this computation. In this capacity, a feature store acts like a data warehouse.
- Feature consistency.
  - A key selling point of modern feature stores is that they unify the logic for both batch features and streaming features, ensuring the consistency between features during training and inference.
Feast is a feature store which works well with batch features, not streaming features.
Tecton is a fully managed feature store that promises to be able to handle both batch features and online features, but their actual traction is slow because they require deep integration.
Platforms like SageMaker and Databricks also offer their own interpretations of feature stores.

Build vs buy

Your build versus buy decisions depend on many factors such as:
- Stage of your company.
  - In the beginning, you might want to leverage vendor solutions to get started as quickly as possible so that you can focus your limited resources on the core offerings of your product. As your use cases grow, however, vendor costs might become exorbitant and it might be cheaper for you to invest in your own solution.
- Focus or competitive advantages of your company
  - If it’s something you want to be really good at, you should manage that in-house. If not, you could use a vendor.
  - For many tech companies where technology is their competitive advantage and whose strong engineering teams prefer to have control over their stacks, they tend to bias towards building. If they use a managed service, they might prefer that service to be modular and customisable, so that they can plug and play with any component.
- Maturity of the available tools
  - For instance, your team might decide that you need a model store and you’d have preferred to use a vendor, but there’s no vendor mature enough for your needs, so you have to build your own feature store, perhaps on top of an open source solution.
In-house, custom infrastructure makes it hard to adopt new technologies available because of the integration issues.