Cloudflare has revealed how it maintains the millions of servers it operates around the world. In a Tuesday post titled “Standalone Hardware Diagnostics and Recovery at Scale” (Autonomous hardware diagnostics and recovery at scale), the company explains that it has built fault-tolerant infrastructure that can continue to operate with "little to no impact" to its services.
But as explained by Jet Marsical's CTO of Infrastructure Engineering and Systems Engineers Aakash Shah and Yilin Xiong, when the servers went down, the Data Center team relied on manual processes to locate the dead boxes. These processes could take "hours for just one server and could easily eat up an engineer's entire day."
Of course, this is not a solution that can work on a hyperscale. Dead servers would sometimes stay up, costing Cloudflare extra money without producing anything useful.
That's where Phoenix comes in – a Cloudflare tool built to detect broken servers and automatically launch the workflows needed to fix them.
Phoenix performs a “discovery” every thirty minutes, during which it probes up to two data centers that are known to host broken boxes. This discovery rate means that Phoenix can find dead servers on Cloudflare's network almost instantly. If it detects machines that are already listed for repairs, it "takes care to ensure that the recovery phase is performed immediately."