A major Cloudflare outage late Wednesday was caued by a technician unplugging a switchboard of cables that provided “all external connectivity to other Cloudflare data centers” — as they decommissioned hardware in an unused rack.
While many core services like the Cloudflare network and the company’s security services were left running, the error left customers unable to “create or update” remote working tool Cloudflare Workers, log into their dashboard, use the API, or make any configuration changes like changing DNS records for over four hours.
CEO Matthew Prince described the series of errors as “painful” and admitted it should “never have happened”. (The company is well known and generally appreciated for providing sometimes wince-inducingly frank post-mortems of issues).
This was painful today. Never should have happened. Great to already see the work to ensure it never will again. We make mistakes — which kills me — but proud we rarely make them twice. https://t.co/pwxbk5plyb
— Matthew Prince 🌥 (@eastdakota) April 16, 2020
Cloudflare CTO John Graham-Cumming admitted to fairly substantial design, documentation and process failures, in a report that may worry customers.
He wrote: “While the external connectivity used diverse providers and led to diverse data centers, we had all the connections going through only one patch panel, creating a single physical point of failure”, acknowledging that poor cable labelling also played a part in slowing a fix, adding “we should take steps to ensure the various cables and panels are labeled for quick identification by anyone working to remediate the problem. This should expedite our ability to access the needed documentation.”
How did it happen to start with? “While sending our technicians instructions to retire hardware, we should call out clearly the cabling that should not be touched…”
Cloudflare is not alone in suffering recent data centre borkage.
Google Cloud recently admitted that “evidence of packet loss, isolated to a single rack of machines” initially seemed to be a mystery, with technicians uncovering “kernel messages in the GFE machines’ base system log” that indicated strange CPU throttling.
A closer physical investigation revealed the answer: the rack was overheating because the casters on the rear, plastic wheels of the rack had failed and the machines were “overheating as a consequence of being tilted”.