The Mean Time To Recovery (MTTR) is a key metric for measuring how well an organization handles system outages, with the goal of minimizing user impact and resolving issues quickly. The first step to resolve any problem is to detect it, which is measured by the Mean Time To Discovery (MTTD). Alerts are crucial in detecting problems, and different services such as Lambda, SNS, and SQS report important metrics to CloudWatch out-of-the-box. For example, alerts should be set up for regional concurrency, throttles, error rate, dead letter errors, destination delivery failures, iterator age, and other relevant metrics. Third-party tools like Lumigo can also add value by enabling built-in alerts with sensible defaults and integrating with popular messaging platforms to send alerts through preferred channels. Additionally, API Gateway requires detailed CloudWatch metrics to be enabled for per-method metrics and latency alerts should use percentiles instead of averages, and should measure the latency as close to the caller as possible. For SQS, alerts should be set up against the ApproximateAgeOfOldestMessage metric, and for Step Functions, alerts should be set up against ExecutionThrottled, ExecutionsAborted, ExecutionsFailed, and ExecutionsTimedOut metrics. An open-source project called cloudwatch-alarms-macro can also codify these settings, allowing users to define organization defaults and generate alerts for resources in their CloudFormation stack.