The Facebook Outage Wasn’t a DDoS attack, but it Shines the Light on Digital Resilience Planning

The Facebook Outage Wasn’t a DDoS attack, but it Shines the Light on Digital Resilience Planning

We were reminded that digital resilience is core to a successful online presence with yesterday’s not subtle multi-property outage at Facebook, Instagram and WhatsApp and its cascading effect to other properties and plug-ins.

The outage lasted around six hours on Monday, 4 October and was also reported to have affected internal systems that appeared to be dependent on resources from the outage, including access to Facebook physical properties. Adam Mosseri of Instagram likened the inability to operate to a “snow day” for Facebook employees as they effectively couldn’t work.

The business impact was clear and showcases the effects of the lack of resilience. A few items reported included:

Outages can be driven by many things. One of the initial conversations within A10 was whether this was a DDoS attack. It was also asked by external parties. The site was down, no response from the servers, not even a fail page, so it might have been. However, the A10 Security Research team saw no unusual activity from our honeypots, or other monitoring systems, but did note the DNS and BGP issues. This pointed to, and was confirmed late yesterday, that core infrastructure issues had caused the outage. Facebook said yesterday:

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”

If you want to read more, ThousandEyes provided a technical article on the outage, covering the DNS and BGP details, while KrebsOnSecurity also offered a detailed summary.

Outages will happen, no matter how much we plan. It’s a fact of life IT professionals will always have to deal with. The challenge we face is how to mitigate that risk as much as possible and how we respond in times of crisis. While not all are specific to the Facebook outage, some best practices include:

The emphasis on digital resilience, both with technology and with planning, is becoming a bigger issue. And this is amplified by examples like the Facebook outage. It serves as a visceral reminder of the impact of downtime.

I am sure the internal Facebook team tasked with fixing this outage was not having a “snow day” yesterday.

Paul Nicholson
October 5, 2021

About Paul Nicholson

Paul Nicholson brings 24 years of experience working with Internet and security companies in the U.S. and U.K. In his current position, Nicholson is responsible for global product marketing and strategy at San Jose, Calif.-based application networking and security leader A10 Networks. Prior to A10 Networks, Nicholson held various technical and management positions at Intel, Pandesic (the Internet company from Intel and SAP), Secure Computing, and various security start-ups. READ MORE