Cloudflare's 1.1.1.1 DNS resolver service experienced an outage due to a parsing error when loading the new root zone file containing the ZONEMD record, which is used for verifying the authenticity and integrity of the data. The incident affected approximately 2% of all DNS queries handled by Cloudflare during that period. The issue was resolved after disabling the static_zone feature in the resolver server.
Recommendations:
- Ensure regular testing and updates of libraries used in critical systems to handle changes in input formats.
- Implement a mechanism to detect when stale data is being served, especially for critical systems like DNS.
- Regularly review and evaluate existing architectures, processes, and test coverage to identify potential vulnerabilities or areas of improvement.