October 2024 queue retrospective
During October 2024, multiple infrastructure upgrades were made to the queueing system at Inngest, which unfortunately resulted in incidents causing degraded performance but no data loss. The issues included degraded performance on October 4th due to a new design of individual queue processing that became less efficient under high load with extreme custom key cardinality; execution and cron delays on October 9th after an improvement and refactor to the cron service; function run status update issues on October 11th due to unreliable data publishing within NATS; delayed execution for some function runs on October 18th due to increased load on the primary queue shard's CPU; sporadic delayed event matching on October 22nd; processing issues with step.waitForEvent and cancelOn operations on October 24th due to unoptimized expressions being processed at high volumes; and degraded UI and DB performance on October 30th due to a disk issue within the Postgres RDS instance. Despite these incidents, improvements were made in various areas such as increased capacity for event matching, optimizing expression handling, adjusting worker configurations, implementing dynamic scaling based on resource utilization metrics, moving reads to a read replica, improving local caching for perf, and applying maintenance to the primary Postgres instance.
Company
Inngest
Date published
Nov. 7, 2024
Author(s)
-
Word count
1118
Language
English
Hacker News points
None found.