Configuration Change, Not Lizard Squad, Takes Down Facebook
Facebook suffered a 45-Minute outage late Monday (this morning in European and Asian time zones), sending Facebook and Instagram both went offline, and leaving Tinder users logged out. Both Facebook's mobile application and its website were inaccessible, as users posted images of an error message from Facebook. Users on Twitter posted complaints from around the world, according to TechCrunch. Initially the hacking group, Lizard Squad, tried to take responsibility (as reported in Forbes), after having taking down PSN and Xbox Live over Christmas as well as carrying out a number of other recent attacks. Instead, Facebook announced that the outage "occurred after we introduced a change that affected our configuration systems."
Gigaom explained that "Downtime was caused by an internal boo-boo, not a hack" Wall Street Journal elaborated, explaining that "Facebook says 45-Minute disruption was due to Configuration Change." Huffington Post consoled us all, saying "Facebook is back up now. Life has resumed."
When a service with over a billion users is not working, users vent their rage and publicize the difficulties with the service.
Mishandled Changes are Painful
For complex distributed systems, like Facebook's IT infrastructure, change happens a lot. Patches are installed, infrastructure is upgraded, automated scripts change files and system environments and configurations are even manually changed. Though usually these activities are performed properly ... sometimes they don't go through as expected. When they don't come off as planned, the cause may only appare when a failure occurs. Change and configuration management is key to the entire IT operations process. Every time a change occurs in the infrastructure, the stability of IT environments is impacted, whether for the deployment of new hardware or applications, or some other change. When an organization can confidently manage change on a continuous basis, this provides the visibility required to help ensure that their infrastructure is safe, and their operations can run faster and smarter.
What Caused the Outage?
Facebook's outage, which also took out the linked services Instagram and Tinder, was down due to a technical issue caused by the company itself and not from external factors.
The outage, according to Facebook, has been attributed to "This was not the result of a third-party attack but instead occurred after we introduced a change that affected our configuration systems," the Facebook statement said. "Both services are back to 100 percent for everyone." CNET add that "Today's outage appears to be one of the worst in four years, after Facebook was broken for two and a half hours back in September 2010. The fact that Facebook's internal snafu spread to other sites and apps highlights a possible danger of Mark Zuckerberg's goal of placing the social network at the heart of the "social graph", as it means problems like these can quickly ripple outwards." Problems during configuration changes are often cited as reasons for outages, where seemingly minor IT changes, both authorized and unauthorized, can impact performance, as illustrated by today's outage.
IT Operations Analytics
Not only at Facebook, but IT Operations in general are facing new levels of challenges in change and configuration management due to complexity and dynamics required in today's operations. As seen from the Facebook outage, organizations suffer tremendous public relations difficulties as well as impact to cash flow, productivity losses, ripple effects to connected systems, and ultimately a lack of confidence in IT. With the operational data explosion, there has been significant demand for IT Operations Analytics systems, to provide immediate insight into operations performance. Gartner analyst, Will Cappelli recognized IT Operations Analytics as an area on the rise saying, "We have taken a position that by 2018, 25% of Global 2000 will have deployed an IT Operations Analytics platform taking data feeds from variety of performance and availability systems, that's up from about 2% today.".