The Facebook Outage Heard Around the World - What IT Ops Can Learn
Facebook blames a server configuration change for its massive outage last week. Why it lasted so long and what IT Ops and Cloud Ops can learn.
- - - -
If you live on earth, you couldn't miss the recent Facebook outage incident.
Well, last week Facebook released a statement on Twitter regarding the reason for the major outage it experienced last Wednesday:
The vague explanation, with no further hints at a more detailed report to come, has left users, businesses, investors, and advertisers confused and disappointed.
The outage, which affected millions of users across Facebook, Instagram, Messenger, WhatsApp, and Oculus, lasted for approximately 14 hours in the United States and even longer in some areas around the globe.
In other words, tens of millions of dollars were lost and some IT Ops executives were not having the best time of their lives …
The outages began Wednesday afternoon ET and appeared to affect people in multiple regions, including the United States, Central and South America, and Europe.
DownDetector, a website that offers real-time overviews of outages across a variety of industries, first started receiving reports around 12:01 PM ET on Wednesday afternoon. The site received thousands of reports of issues each hour from users around the world, peaking at over 12,000 around 9:00 PM ET.
IT and Cloud Ops Worst Nightmare
“By duration, this is by far the largest outage we have seen since the launch of DownDetector in 2012,” said Tom Sanders, co-founder of DownDetector. “Our systems processed about 7.5 million problem reports from end users over the course of this incident. Never before have we such a large scale outage.”
Many Facebook users got an error message during the outage that said: "Sorry, something went wrong. We're working on getting this fixed as fast as we can” while users on Instagram were unable to post photos at all.
With Facebook itself seeing more than 2.3 billion users, and Instagram with over 1 billion, it was only a matter of time before the comments started rolling in. People went on Twitter to vent their frustration about the problems. The hashtag #FacebookDown and #InstagramDown were trending on Twitter for much of the day, with the event even being added to Twitter Moments later in the day.
It wasn’t long until the memes also started flooding feeds, with Twitter users poking fun at the suite of social media channels finally giving competitors a chance:
Don’t Call It a Comeback
At 12:41 AM ET Thursday morning, Instagram finally confirmed it was back online via Twitter:
However, some users continued to report problems with uploading and commenting throughout the day on Thursday, most specifically those located in Asia.
What Went Wrong and Why It Took So Long to Recover?
As mentioned in its tweet, Facebook blamed a server configuration change on the massive outage. After the dust has settled, company representatives started making statements. One of the spokespersons shared his experience with Sky News and said: "Yesterday, we made a server configuration change that triggered a cascading series of issues. As a result, many people had difficulty accessing our apps and services. We have resolved the issues, and our systems have been recovering over the last few hours. We are very sorry for the inconvenience and we appreciate everyone's patience.”
Since 14(!) hours passed by before Facebook was on top of the issue and the outage was fixed, we can only assume that Facebook didn't know what change was made that created this mess.
They needed to understand the change path in order to track the exact root-cause and recover. But it looks like they needed to run a lot of digging.
Changes are an IT and Cloud Ops Inevitable Evil
While this is brief in its description, and this sort of issue obviously causes performance and availability incidents, what exactly goes wrong when mis-configuration or “undesired changes” occur?
To start off with, undesired changes can include a large variety of changes in the IT environment: configuration, data, capacity, code, and workload. Most organizations recognize it when 'undesired changes' are happening not because they are notified of a problem ahead, but rather because someone notices the symptoms of an undesired change occurring, things like higher CPU usage, slower transactions, an extremely slow Java query, or in Facebook’s case, an horrible, brand and revenue killing outage.
While these symptoms can guide IT operations towards the general direction of where the problem lies, they in themselves are not indicating the actual root cause of the problem.
Keep Your Eyes on Changes
Any change that happens in the IT environment must be monitored and analyzed for risk at the most granular level, as even minor changes can turn into high-impact incidents, like in the Facebook example. It turns out that while unknown changes are the root cause of a majority of stability issues, IT departments still struggle to correctly identify what actually changed.
Wait, Why Don’t More Enterprises Use Advanced Solutions to Track IT Changes?
Collecting information about changes in the environment state is not an easy task, particularly in configuration at the granular level and in near real-time. A huge amount of data is stored in different formats across different sources (files, databases, registry in Windows, APIs, system utilities etc.) and the volume of changes with environment-state is enormous.
As a result, both IT and vendors have put their focus on the far more accessible area of symptoms as opposed to keeping their eyes on changes. Simply put, they monitor the performance, as it's easier. But they leave the most critical part neglected.
I am not sure why Evolven is the only vendor (at least the only one that I know of) in the entire industry that focuses on Change Analytics, with a technology that continuously monitors all changes carried out in the IT environment, and analyzes any granular change for risk, whether they be automated or manual, planned or unplanned, authorized or unauthorized. Think about it, it makes so much sense, doesn't it?
How it's done?
Utilizing patented analytics and machine learning, Evolven is able to help IT Operations, Cloud Ops and DevOps teams analyze changes, get visibility on their risk factor, and hence experience less incidents in the first place, improve productivity, and greatly reduce outages and downtimes. This type of technology that is complementary to APM systems is able to collect the most granular change information across the entire end-to-end environment and apply analytics to present the highest risk changes and inconsistencies.
The Bottom Line
Unfortunately for Facebook, it seems that no one on board was utilizing a technology that could have noticed an undesired change right away. Hence, they could not ultimately fix the outage in minutes, rather than hours.
Whether organizations are a goliath of social media, or simply startups responsible for a service, keeping servers running without incidents is vital for sales and credibility. If a company like Facebook can experience such a massive issue from something like a configuration change, then it is apparent that it could happen to almost anyone.
In order to avoid these issues from happening again, Facebook needs to invest in change monitoring technology that can catch the undesired changes before they even turn into incidents.