Ethan Miller/Getty Images
show image

Cloudflare says botched software rollout caused global web outages

A server outage estimated to have taken down ten per cent of the internet earlier this week was caused by a botched software update, rather than a denial of service attack.

Many observers had assumed Tuesday’s Cloudflare outage was the result of a DoS attack, with some even speculating it could have been launched by the Chinese government in a bid to target protesters in Hong Kong.

But in a statement issued online by Cloudflare, a networking and security company, chief technology officer John Graham-Cumming put paid to the speculation.

“This was not an attack (as some have speculated) and we are incredibly sorry that this incident occurred,” he wrote. “Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.”

The outage began at 2:42 BST on 2 July and lasted for less than an hour. But in that short time millions of internet users around the world were greeted with 502 “Bad Gateway” errors as they tried to access websites.

The outage followed a massive spike in CPU utilisation on Cloudflare’s website caused by “a bad software deploy”, specifically a “single misconfigured rule within the Cloudflare Web Application Firewall (WAF) during a routine deployment of new [managed rules],” according to the company.

Graham-Cumming explained: “The intent of these new rules was to improve the blocking of inline JavaScript that is used in attacks. These rules were being deployed in a simulated mode where issues are identified and logged by the new rule but no customer traffic is actually blocked so that we can measure false positive rates and ensure that the new rules do not cause problems when they are deployed into full production.

“Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%.”

Once the new rules had been rolled back, the CPU returned to normal and traffic was resumed by 15:09 BST. But in the meantime a number of sites who either depended on Cloudflare or other suppliers faced major issues. CoinDesk, for example, had to clarify that after displaying “bad data from our providers” as a result of the outage, the price of Bitcoin had not fallen to $26.