Database Sharding

Company

PlanetScale

Date Published

Jan. 9, 2025

Author

Word count

3191

Language

English

Hacker News points

URL

planetscale.com/blog/database-sharding

Summary

Sharding is the process of scaling a database by spreading out the data across multiple servers, or shards. This allows large organizations to manage petabyte-scale data and scale their massive databases with popular solutions like Vitess and MySQL. A sharded database cluster consists of multiple separate database servers, each with a portion of the total data. The code running on the application server has to be aware of all of the shards and keep a connection open to each, which can become complex when there are hundreds of them. Using an intermediary server called a proxy helps to manage this complexity. Sharding strategies include range sharding, hash sharding, and custom sharding functions. Range sharding involves choosing a column to be the shard key and generating a cryptographic hash of this value for each row that needs to be inserted. Hash sharding is also popular and uses similar principles to distribute data evenly across shards. However, it requires careful consideration when selecting a column as the shard key due to volatility issues such as changes in user step counts over time. Cross-shard queries can occur when multiple shards are needed to fulfill a single query, adding network and CPU overhead. Sharding also improves data durability by replicating data across servers, increasing availability, and reducing backup time when data is spread out across multiple shards.