/plushcap/analysis/inngest/inngest-2024-08-16-incident-report

Incident report for August 16, 2024 - Function execution outage

What's this blog post about?

On August 16, 2024, Inngest experienced an outage that prevented function execution from successfully running due to the SDK Gateway's inability to handle requests properly. The issue was fixed on each instance to bring them back online and prevent future failures of this type. The root cause was traced to the SDK Gateway services log, which ran out of disk space due to increased load and infrequent logrotate running. The timeline for resolving the issue included manual intervention to clear disk space, rotating internal service discovery IPs, scaling down executor services, issuing new static IP addresses, and adding additional SDK Gateways for extra capacity and redundancy. The impact of this outage was a 46-minute function execution downtime and degraded performance on Vercel for 4 hours and 54 minutes. Corrective actions included improving alerting via Datadog, enhancing log rotation in the instance AMI, increasing SDK Gateway replicas, staggering service rotations, rehearsing with engineers to restore issues with SDK Gateways, and purchasing a /24 block of public IP addresses for future use.

Company
Inngest

Date published
Aug. 16, 2024

Author(s)
Dan Farrelly

Word count
1078

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.