About Lesson
After a particular point, it might not be feasible to store the entire database on one machine. We may need to partition the data. Sharding is the technique to horizontally partition (split by rows) data and distribute it across multiple servers.
How do we shard the data?
- Hash-based sharding
- In this method, we determine the shard ID by applying a hash function to a specific attribute (key) of the entity we’re storing.
- For instance, if we have 10 database servers, and our IDs are numeric values that increment with each new record insertion, we can use a hash function like ‘ID % 10’ to determine the server number where the record should be stored or retrieved.
- This approach aims to evenly distribute data across servers.
- However, a significant drawback is that it locks in the total number of database servers, as adding new servers would require changing the hash function, leading to data redistribution and service downtime. To address this, Consistent Hashing can be employed.
- Geo sharding
- In this approach, we split and store database information according to geographical location. When inserting a new record, we use location as our shard key.
- For example, we might decide that all users residing in the subcontinent will be placed in a shard designated for South Asian countries.
- Round-robin sharding
- This is a straightforward strategy that ensures an even distribution of data. With ‘n’ shards, the ‘i-th’ tuple is assigned to the shard i % n.
- Range-based sharding
- It splits database rows based on a range of values and each range is assigned a shard key.
- For example, we may partition the data according to the first alphabet in the customer’s name. So, all customers with name starting in A go to shard A and so on.
- It is easier to implement.
- Depending on the data values, range-based sharding can result in the overloading of data on a single physical node. Say, in the above example, if a lot of customers have names starting with A, then shard A will get overloaded.
- Directory sharding
- Uses a lookup table to match database information to the corresponding physical shard. A lookup table is like a table on a spreadsheet that links a database column to a shard key.
- It is flexible. Each shard is a meaningful representation of the database and is not limited by ranges.
- Directory sharding is heavily dependant on the lookup table, which becomes a single point of failure and any issue with it could cause a big problem.
- Composite sharding
- Under this scheme, we combine elements of the aforementioned sharding methods to create a new approach.
- For example, we might first employ geo sharding and then apply hash-based sharding.
Common challenges with sharding:
- Rebalancing data
- Some of the shards may become unbalanced due to the uneven distribution of data.
- There could be a lot of load on a particular shard. For example, there are too many requests being handled by the DB shard dedicated to California.
- To fix this, we may need to rebalance the data in the existing shards or add new shards. This would mean that the original sharding logic needs to be updated, which would likely lead to some downtime.
- Complexity
- Database sharding creates operational complexity. Instead of managing a single database, developers have to manage multiple database nodes. When they are retrieving information, developers must query several shards and combine the pieces of information together. These retrieval operations can complicate analytics.
- Most database management systems do not have built-in sharding features. This means that database designers and software developers must manually split, distribute, and manage the database.
- Infrastructure costs
- Adding more machines and maintaining them could become more and more expensive as we scale our on-premises data centre.