Training Data -

Machine Learning System Design

About Lesson

Data is messy, complex, unpredictable and potentially treacherous. If not handled properly, it can easily sink your entire ML operation.

Data is full of potential biases. There are biases caused during collecting, sampling, or labelling. Historical data might be embedded with human biases and ML models, trained on this data, can perpetuate them.

Sampling

Understanding different sampling methods and how they are being used in our workflow can, first, help us avoid potential sampling biases and second, help us choose the methods that improve the efficiency of the data we sample.

There are two families of sampling: non-probability sampling and random sampling.

Non-probability Sampling

Non-probability sampling is when the selection of data isn’t based on any probability criteria. Here are some of the criteria for non-probability sampling:

Convenience sampling
- Samples of data are selected based on their availability. This sampling method is popular because, well, it’s convenient.
Snowball sampling
- Future samples are selected based on existing samples. For example, to scrape legitimate Twitter accounts without having access to Twitter databases, you start with a small number of accounts, then you scrape all the accounts they follow and so on.
Judgment sampling
- Experts decide what samples to include.
Quota sampling
- You select samples based on quotas for certain slices of data without any randomisation.

The samples selected by non-probability criteria are not representative of the real-world data and therefore are riddled with selection biases.

Language models are often trained not with data that is representative of all possible texts but with data that can be easily collected—Wikipedia, Common Crawl, Reddit.

Other examples are data for sentiment analysis and self-driving cars.

Non-probability sampling can be a quick and easy way to gather your initial data to get your project off the ground. However, for reliable models, you might want to use probability-based sampling(discussed below).

Simple random sampling

You give all samples in the population equal probabilities of being selected.
The advantage of this method is that it’s easy to implement. The drawback is that rare categories of data might not appear in your selection.

Stratified sampling

You can first divide your population into the groups that you care about and sample from each group separately.
One drawback of this sampling method is that it isn’t always possible, such as when it’s impossible to divide all samples into groups.

Weighted sampling

Each sample is given a weight, which determines the probability of it being selected.

Reservoir sampling(streaming data; useful in production)

Put the first k elements into the reservoir.
For each incoming nth element, generate a random number i such that 1 ≤ i ≤ n.
If 1 ≤ i ≤ k: replace the ith element in the reservoir with the nth element. Else, do nothing.

Importance sampling

It allows us to sample from a distribution when we only have access to another distribution.
One example where importance sampling is used in ML is policy-based reinforcement learning.

Labelling

Most ML models in production today are supervised, which means that they need labelled data to learn from. The performance of an ML model still depends heavily on the quality and quantity of the labelled data it’s trained on.

Hand labels
- Hand-labelling data can be expensive, especially if subject matter expertise is required.
- Hand labelling poses a threat to data privacy.
- Hand labelling is slow. Slow labelling leads to slow iteration speed and makes your model less adaptive to changing environments and requirements.
- Label ambiguity or label multiplicity: multiple conflicting labels for a data instance.
  - Disagreements among annotators are extremely common.
  - To minimise the disagreement among annotators, it’s important to first have a clear problem definition.
  - You need to incorporate that definition into the annotators’ training to make sure that all annotators understand the rules.
- Data lineage
  - Indiscriminately using data from multiple sources, generated with different annotators, without examining their quality can cause your model to fail mysteriously.
  - It’s good practice to keep track of the origin of each of your data samples as well as its labels, a technique known as data lineage.
  - Data lineage helps you both flag potential biases in your data and debug your models.
Natural labels
- Tasks with natural labels are tasks where the model’s predictions can be automatically evaluated or partially evaluated by the system. Examples: estimate time of arrival for a certain route on Google Maps, stock price prediction, recommendation systems.
- When building a product recommender system, many companies focus on optimising for clicks, which give them a higher volume of feedback to evaluate their models. However, some companies focus on purchases, which gives them a stronger signal that is also more correlated to their business metrics (e.g., revenue from product sales). Both approaches are valid. There’s no definite answer to what type of feedback you should optimise for your use case and it merits serious discussions between all stakeholders involved.
- Fraud detection is an example of a task with long feedback loops.

Handling lack of labels

A number of techniques have been developed to address the challenges in acquiring sufficient high-quality labels.

Weak supervision
- Heuristics developed with subject matter expertise, to label data.
- In theory, you don’t need any hand labels for weak supervision. However, to get a sense of how accurate your labelling functions are, a small number of hand labels is recommended.
- Weak supervision can be especially useful when your data has strict privacy requirements.
- Weak supervision is a simple but powerful paradigm. However, it’s not perfect. In some cases, the labels obtained by weak supervision might be too noisy to be useful. But even in these cases, weak supervision can be a good way to get you started when you want to explore the effectiveness of ML without wanting to invest too much in hand labelling up front.
- See Snorkel.
Semi-supervision
- Semi-supervision requires an initial set of labels.
- A classic semi-supervision method is self-training. You start by training a model on your existing set of labelled data and use this model to make predictions for unlabelled samples.
- Another semi-supervision method assumes that data samples that share similar characteristics share the same labels.
- In most cases, the similarity can only be discovered by more complex methods.
- Another popular semi-supervision method is called perturbation-based method. It’s based on the assumption that small perturbations to a sample shouldn’t change its label.
Transfer learning
- Transfer learning refers to the family of methods where a model developed for a task is reused as the starting point for a model on a second task.
Active learning
- Active learning is a method for improving the efficiency of data labels.
- Instead of randomly labelling data samples, you label the samples that are most helpful to your models according to some metrics or heuristics.
- Metrics or heuristics used for this include
  - Uncertainty measurement – label the examples that your model is least certain about.
  - Disagreement among multiple candidate models(query-by-committee).
  - Choosing samples that, if trained on them, will give the highest gradient updates or will reduce the loss the most.

Class Imbalance

Class imbalance typically refers to a problem in classification tasks where there is a substantial difference in the number of samples in each class of the training data. For example, in a training dataset for the task of detecting lung cancer from X-ray images, 99.99% of the X-rays might be of normal lungs and only 0.01% might contain cancerous cells.

Class imbalance can make learning difficult for the following three reasons.

Class imbalance often means there’s insufficient signal for your model to learn to detect the minority class.
Class imbalance makes it easier for your model to get stuck in a non-optimal solution by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data.
Class imbalance leads to asymmetric costs of error—the cost of a wrong prediction on a sample of the rare class might be much higher than a wrong prediction on a sample of the majority class.

Outside the cases where class imbalance is inherent in the problem, class imbalance can also be caused by biases during the sampling process. Consider the case when you want to create training data to detect whether an email is spam. You decide to use all the anonymised emails from your company’s email database. According to Talos Intelligence, as of May 2021, nearly 85% of all emails are spam. But most spam emails were filtered out before they reached your company’s database, so in your dataset, only a small percentage is spam.

Another cause for class imbalance, though less common, is due to labelling errors. Annotators might have read the instructions wrong or followed the wrong instructions (thinking there are only two classes, POSITIVE and NEGATIVE, while there are actually three), or simply made errors. Whenever faced with the problem of class imbalance, it’s important to examine your data to understand the causes of it.

Handling class imbalance

Sensitivity to imbalance increases with the complexity of the problem and that non-complex, linearly separable problems are unaffected by all levels of class imbalance.

Three approaches to handling class imbalance:

Choosing the right metrics for your problem.
Data-level methods: Changing the data distribution to make it less imbalanced.
Algorithm-level methods: Changing your learning method to make it more robust to class imbalance.

Precision-Recall Curve gives a more informative picture of an algorithm’s performance on tasks with heavy class imbalance.

Data-level methods: Resampling
- Data-level methods modify the distribution of the training data to reduce the level of imbalance to make it easier for the model to learn.
- Oversampling: adding more instances from the minority class.
- Undersampling: removing instances of the majority class.
- A method of undersampling low-dimensional data(Tomek links): You find pairs of samples from opposite classes that are close in proximity and remove the sample of the majority class in each pair. While this makes the decision boundary more clear and arguably helps models learn the boundary better, it may make the model less robust because the model doesn’t get to learn from the subtleties of the true decision boundary.
- A method of oversampling low-dimensional data is SMOTE (synthetic minority oversampling technique). It synthesises novel samples of the minority class through sampling convex combinations of existing data points within the minority class.
- When you resample your training data, never evaluate your model on resampled data, since it will cause your model to overfit to that resampled distribution.
- Undersampling runs the risk of losing important data from removing data. Oversampling runs the risk of overfitting on training data, especially if the added copies of the minority class are replicas of existing data.
- Sophisticated sampling techniques have been developed to mitigate these risks.
  - Two-phase learning: You first train your model on the resampled data. This resampled data can be obtained by randomly undersampling large classes until each class has only N instances. You then fine-tune your model on the original data.
  - Dynamic sampling: Oversample the low-performing classes and undersample the high-performing classes during the training process. The method aims to show the model less of what it has already learnt and more of what it has not.
Algorithm-level methods
- Algorithm-level methods keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance.
- Because the loss function (or the cost function) guides the learning process, many algorithm-level methods involve adjustment to the loss function.
- By giving the training instances we care about higher weight, we can make the model focus more on learning these instances.
- Cost-sensitive learning
  - Misclassification of different classes incurs different costs, individual loss function is modified to take into account this varying cost.
  - The problem with this loss function is that you have to manually define the cost matrix, which is different for different tasks at different scales.
- Class-balanced loss
  - What might happen with a model trained on an imbalanced data set is that it’ll bias towards majority classes and make wrong predictions on minority classes. What if we punish the model for making wrong predictions on minority classes to correct this bias?
  - In its vanilla form, we can make the weight of each class inversely proportional to the number of samples in that class, so that the rarer classes have higher weights.
  - A more sophisticated version of this loss can take into account the overlap among existing samples, such as class-balanced loss based on effective number of samples.
- Focal loss
  - In our data, some examples are easier to classify than others and our model might learn to classify them quickly. We want to incentivise our model to focus on learning the samples it still has difficulty classifying. What if we adjust the loss so that if a sample has a lower probability of being right, it’ll have a higher weight? This is exactly what focal loss does.
In practice, ensembles have shown to help with the class imbalance problem. However, we don’t include ensembling in this section because class imbalance isn’t usually why ensembles are used.

Data Augmentation

Data augmentation is a family of techniques that are used to increase the amount of training data.
Augmented data can make our models more robust to noise and even adversarial attacks.
The techniques depend heavily on the data format, as image manipulation is different from text manipulation.
Simple label-preserving transformations
- In computer vision, you may modify the image by cropping, flipping, rotating, inverting (horizontally or vertically), erasing part of the image and more.
- In NLP, you can randomly replace a word with a similar word, assuming that this replacement wouldn’t change the meaning or the sentiment of the sentence.
- Similar words can be found either with a dictionary of synonymous words or by finding words whose embeddings are close to each other in a word embedding space.
Perturbation(Adding noises)
- Neural networks, in general, are sensitive to noise.
- Using deceptive data to trick a neural network into making wrong predictions is called adversarial attacks.
- Adding noisy samples to training data can help models recognise the weak spots in their learned decision boundary and improve their performance.
- Noisy samples can be created by either adding random noise or by a search strategy.
- DeepFool finds the minimum possible noise injection needed to cause a misclassification with high confidence. This type of augmentation is called adversarial augmentation.
Data synthesis
- It’s possible to synthesise some training data to boost a model’s performance.
- In NLP, templates can be a cheap way to bootstrap your model.
- In computer vision, a straightforward way to synthesise new data is to combine existing examples with discrete labels to generate continuous labels.