Knowing how to collect, store, retrieve and process an increasingly growing amount of data is essential to people who want to build ML systems in production.
An ML system can work with data from many different sources. They have different characteristics, can be used for different purposes and require different processing methods. Understanding the sources your data come from can help you use your data more efficiently.
Once you have data, you might want to store it. Since your data comes from multiple sources with different access patterns, storing your data isn’t always straightforward and, in some cases, can be costly. It’s important to think about how the data will be used in the future so that the format you use will make sense. Here are some of the questions you might want to consider:
- How do I store multimodal data, e.g., a sample that might contain both images and texts?
- Where do I store my data so that it’s cheap and still fast to access?
- How do I store complex models so that they can be loaded and run correctly on different hardware?
The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructed later is called data serialisation. There are many data serialisation formats. When considering a format to work with, you might want to consider different characteristics such as human readability, access patterns and whether it’s based on text or binary format, which influences the size of its files.
Common data formats:
Format | Binary/Text | Human-readable | Example use-cases |
Avro | Binary | No | Hadoop |
CSV | Text | Yes | Everywhere |
JSON | Text | Yes | Everywhere |
Parquet | Binary | No | Hadoop, Amazon Redshift |
Pickle | Binary | No | Python, Pytorch Serialisation |
Protobuf | Binary | No | Google, ONNX, Tensorflow(TFRecord) |
Data models describe how data is represented. Consider cars in the real world. In a database, a car can be described using its make, model, year, colour and price. These attributes make up a data model for cars. Alternatively, you can also describe a car using its owner, license plate and history of registered addresses. This is another data model for cars.
How you choose to represent data not only affects the way your systems are built, but also the problems your systems can solve. For example, the way you represent cars in the first data model makes it easier for people looking to buy cars, whereas the second data model makes it easier for police officers to track down criminals.
Relational models
In this model, data is organised into relations(tables); each relation is a set of tuples.
It’s often desirable for relations to be normalised. One major downside of normalisation is that your data is now spread across multiple relations. You can join the data from different relations back together, but joining can be expensive for large tables.
Databases built around the relational data model are relational databases. Once you’ve put data in your databases, you’ll want a way to retrieve it. The language that you can use to specify the data that you want from a database is called a query language.
NoSQL
The relational data model has been able to generalise to a lot of use cases, from e-commerce to finance to social networks. However, for certain use cases, this model can be restrictive. For example, it demands that your data follows a strict schema and schema management is painful. It can also be difficult to write and execute SQL queries for specialised applications.
Two major types of non-relational models are the document model and the graph model. The document model targets use cases where data comes in self-contained documents and relationships between one document and another are rare. The graph model goes in the opposite direction, targeting use cases where relationships between data items are common and important.