/plushcap/analysis/cloudflare/secondary-dns-deep-dive

Secondary DNS - Deep Dive

What's this blog post about?

Secondary DNS is a crucial component in maintaining the stability and reliability of the Domain Name System (DNS). It functions as a synchronized backup for primary DNS servers when they are unable to respond to queries, ensuring that users can still access websites even during server downtime. The process involves unidirectional transfer of DNS zones from the primary to secondary DNS server(s), with zone updates triggering an increase in the Start of Authority (SOA) serial number. The SOA record is a key element in Secondary DNS, containing important information such as the serial number, refresh rate, retry interval, expiration time, and minimum TTL. These values help primary and secondary servers synchronize zones effectively. To ensure efficient zone transfers between primary and secondary DNS servers, standard protocols like Authoritative Zone Transfer (AXFR) and Incremental Zone Transfer (IXFR) are used. AXFR is done over a TCP connection to transfer all zone contents in one connection, while IXFR only sends the records that have changed since the secondary server's current version of the zone. At Cloudflare, Secondary DNS has been implemented using microservice architecture and Kubernetes for better scalability and security. The migration from Marathon-based services to Kubernetes introduced challenges such as creating a distributed and reliable system while protecting individual data centers from denial-of-service attacks. To address these issues, Cloudflare proxied its egress traffic through Shadowsocks and utilized Spectrum for reverse proxy TCP/UDP traffic, filtering out malicious traffic, and ensuring optimal routing. In addition to the technical challenges faced during the migration process, there were also performance considerations. Cloudflare's Secondary DNS pipeline includes various stages such as primary server updates, zone transfers, zone building, and propagation to the edge. The end-to-end latency for these processes is less than 5 seconds on average, ensuring fast and reliable access to websites even during primary server downtime.

Company
Cloudflare

Date published
Sept. 15, 2020

Author(s)
Alex Fattouche

Word count
2993

Hacker News points
4

Language
English


By Matt Makai. 2021-2024.