Company
Date Published
June 10, 2021
Author
Michael Troute
Word count
2070
Language
English
Hacker News points
137

Summary

Amazon Aurora MySQL is a primary relational database for Plaid. In 2019, they noticed performance degradation on one of their Aurora clusters due to an errant long-running transaction on a read replica. This issue seemed to violate the assumption that read replica load should not impact writer performance. Digging deeper into the design and documentation of Aurora DB, Plaid identified the cause of the degradation and best practices to avoid it. In vanilla MySQL, transactions are isolated using multiversion concurrency control (MVCC), which allows each operation to behave as if it is the only operation running on the database. This is achieved through undo logs that can recreate prior states of rows by undoing updates. Long-running transactions can cause performance degradation for concurrent transactions, especially when querying heavily indexed tables with frequent changes. Aurora MySQL differs from vanilla MySQL in its persistence and replication models. In Aurora, compute (query execution) and storage (durably persisting data to disk) happen on different instances. Persistence and replication are both accomplished using the redo log. Reader and writer instances share a set of undo logs, which can lead to performance degradation when long-running read replica transactions occur. Mitigation strategies for this issue include reducing transaction isolation levels for replica instances, monitoring rollback segment history list length, performing potentially long-running reader queries on binlog replicas, exporting snapshots to S3 as Apache Parquet files, and using a clone to handle long-running reads.