Engineering a fault tolerant distributed system
The text discusses the design and engineering of fault tolerant systems that can detect and remediate failures at scale. It explains dependability as a measure of both availability and reliability, with availability being when a service is available for use when required, and reliability being whether a service works as expected. A key aspect of fault tolerance is redundancy, which involves exceeding the capacity required to deliver service. The text also covers stateless and stateful services, architectural approaches to achieve reliability, consensus formation in globally-distributed systems, health is not binary, resource availability impacts on fault tolerance, and resource scalability impacts on fault tolerance. It concludes by stating that fault tolerance is an approach to building systems able to withstand and mitigate adverse events and operating conditions in order to dependably continue delivering the level of service expected by users.
Company
Ably
Date published
Feb. 15, 2021
Author(s)
Paddy Byers
Word count
3669
Hacker News points
None found.
Language
English