Company
Date Published
Author
Neil Manvar
Word count
1236
Language
English
Hacker News points
None

Summary

Uptime is defined as the time when an application or service is operational, with operational meaning varying across different contexts. Measuring uptime typically uses nines, where each nine represents a level of reliability, with one nine equating to 36.5 days of downtime per year and five nines representing less than 6 minutes of downtime per year. Achieving higher nines requires more validation and automation, as well as a carefully cultivated engineering culture that encourages high quality and accountability. Balancing error prevention and awareness is crucial for improving service uptime, with code validation and checks contributing to cleaner code that prevents errors in production. Application monitoring provides real-time insights into a service's health, alerting developer teams to issues immediately and minimizing the mean time to recovery.