Improving database resilience with observability and chaos testing

Company

New Relic

Date Published

March 21, 2024

Author

Bryant Vinisky, Lead Software Engineer

Word count

1918

Language

English

Hacker News points

None

URL

newrelic.com/blog/how-to-relic/improving-database-resilience-with-observability-and-chaos-testing

Summary

This article discusses the importance of chaos engineering in building resilient systems, particularly for distributed databases like Amazon Aurora. It highlights the benefits of chaos testing at the database layer, including validating failover and application robustness, mitigating potential outages and protecting data, ensuring effective observability and alerting, deepening system understanding and improving documentation, and optimizing capacity and performance. The article also provides guidance on setting up observability and monitoring using New Relic infrastructure agents, APM, and CloudWatch metrics streams, as well as executing an Aurora failover and identifying and troubleshooting driver issues. Additionally, it discusses the importance of proper driver configuration and best practices for handling database connection failures during failovers. The article concludes by emphasizing the need for effective observability and monitoring in chaos experiments and provides resources for further reading on Aurora's reliability best practices and chaos engineering.