Η Cloudflare αποκάλυψε το πώς διατηρεί τους εκατομμύρια servers που λειτουργεί σε όλο τον κόσμο. Σε μια ανάρτηση της Τρίτης με τίτλο “Αυτόνομα διαγνωστικά υλικού και recovery to scale" (Autonomous hardware diagnostics and recovery at scale), the company explains that it has built fault-tolerant infrastructure that can continue to operate with "little to no impact" to its services.
But as explained by Jet Marsical's CTO of Infrastructure Engineering and Systems Engineers Aakash Shah and Yilin Xiong, when the servers went down, his team Data Center relied on manual procedures to identify dead boxes. These processes could take "hours for just one server and could easily eat up an engineer's entire day."
Of course, this is not a solution that can work on a hyperscale. Dead servers would sometimes stay up, costing Cloudflare extra money without producing anything useful.
That's where Phoenix comes in – a Cloudflare tool built to detect broken servers and automatically launch the workflows needed to fix them.
Phoenix makes a "discovery" every thirty minutes, at duration of which he explores up to two data centers known to host broken boxes; This discovery rate means that Phoenix can find dead servers on Cloudflare's network almost instantly. If it detects machines that are already listed for repairs, it "takes care to ensure that the recovery phase is performed immediately."