On September 10, 2012 GoDaddy experienced a massive outage that left thousands of customers without their websites. There has been a lot of speculation as to how the problems were caused, but now the GoDaddy CIO has broken the company’s silence in a detailed 1,500 word blog post.
Chief Information Officer Auguste Goldman revealed that the problems were caused by a “perfect storm” of issues that came together to cause the network failures. The idea the company was hacked, which a member of Anonymous did claim had happened, was dismissed outright in the post.
The post outlined GoDaddy as a company receiving an average of 10 billion different DNS queries on a daily basis across 41 million different DNS zones. These DNS requests are pushed through the company’s anycast BGP routing system. It’s designed to reduce latency if any of the hardware collapses, but on that day the event pushed the routers beyond their normal limits.
Goldman simplified the problems by stating the three different factors contributing to the problems. These issues were Router hardware failure modes, Router memory exhaustion, and Containment. Within minutes, a chain reaction had been caused by the routing hardware failure, which led to massive outages all across the system.
GoDaddy restored its services through filtering the routing information and restoring the routing table, alongside a complete reboot of the routers struggling to cope. The main difficulty, according to Goldman, was to deal with the timed out DNS traffic still on the network.
Goldman stated the reasoning for revealing the problems by saying it was to fulfill its goal of providing transparency: “and detail the specific elements we have implemented to prevent another such occurrence.”
What remains to be seen is how these problems may impact the webhosting company’s image going forward.