Incident report for August 16, 2024 - Function execution outage
On August 16, 2024, Inngest experienced an outage that prevented function execution from successfully running due to the SDK Gateway's inability to handle requests properly. The issue was fixed on each instance to bring them back online and prevent future failures of this type. The root cause was traced to the SDK Gateway services log, which ran out of disk space due to increased load and infrequent logrotate running. The timeline for resolving the issue included manual intervention to clear disk space, rotating internal service discovery IPs, scaling down executor services, issuing new static IP addresses, and adding additional SDK Gateways for extra capacity and redundancy. The impact of this outage was a 46-minute function execution downtime and degraded performance on Vercel for 4 hours and 54 minutes. Corrective actions included improving alerting via Datadog, enhancing log rotation in the instance AMI, increasing SDK Gateway replicas, staggering service rotations, rehearsing with engineers to restore issues with SDK Gateways, and purchasing a /24 block of public IP addresses for future use.
Company
Inngest
Date published
Aug. 16, 2024
Author(s)
Dan Farrelly
Word count
1078
Language
English
Hacker News points
None found.