We were reminded that digital resilience is core to a successful online presence with yesterday’s not subtle multi-property outage at Facebook, Instagram and WhatsApp and its cascading effect to other properties and plug-ins. This also provides a reminder to any organization including the service provider networks and the colocation, data center or hosting providers that increasingly house critical applications and infrastructure.
The outage lasted around six hours on Monday, 4 October and was also reported to have affected internal systems that appeared to be dependent on resources from the outage, including access to Facebook physical properties. Adam Mosseri of Instagram likened the inability to operate to a “snow day” for Facebook employees as they effectively couldn’t work.
The business impact was clear and showcases the effects of the lack of resilience. A few items reported included:
Outages can be driven by many things. One of the initial conversations within A10 was whether this was a DDoS attack. It was also asked by external parties. The site was down, no response from the servers, not even a fail page, so it might have been. However, the A10 Security Research team saw no unusual activity from our honeypots, or other monitoring systems, but did note the DNS and BGP issues. This pointed to, and was confirmed late yesterday, that core infrastructure issues had caused the outage. Facebook said yesterday:
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.”
Outages will happen, no matter how much we plan. It’s a fact of life IT professionals will always have to deal with. The challenge we face is how to mitigate that risk as much as possible and how we respond in times of crisis. While not all are specific to the Facebook outage, some best practices include:
The emphasis on digital resilience, both with technology and with planning, is becoming a bigger issue. And this is amplified by examples like the Facebook outage. It serves as a visceral reminder of the impact of downtime.
I am sure the internal Facebook team tasked with fixing this outage was not having a “snow day” yesterday.