‘What Change Caused this Mess?’ 8 Lessons IT Ops Must Learn from the Facebook Outage
Even though It’s 2019, enterprise IT teams are still struggling with massive outages and other performance incidents that have far-reaching impacts across multiple business aspects (let alone the bottom line).
Take for example the latest Facebook massive outage, which affected millions of people worldwide, lasted for over 14 hours and is believed to have cost Facebook tens of millions of dollars!
Blamed on a “server configuration change,” this human error not only knocked down the social media giant’s main platforms like Facebook, Instagram, and WhatsApp but has also further harmed the company in terms of credibility and reliability, as seen in endless media mentions.
These types of outages are considered IT and Cloud Ops’ worst nightmares, not only damaging the performance and resulting in a massive loss of revenues (ad-related revenues in the case of Facebook), but also badly impacting the reputation and credibility of the brand.
I am sharing here my list of 8 top lessons that IT Operations can learn from the recent Facebook outage:
1. It happens to the best
These types of incidents have been happening for the past 20 years, are considered very common (though painful), and the industry is still struggling with too many of them.
Amazon, Microsoft, Facebook, Google, Delta Airlines, HSBC and many other giants make up only a partial list of the enterprises that “enjoyed” being mentioned in the news at one point or another, due to an enormous outage that was probably caused by a sort of a granular change in their network, data applications or the like.
These enterprises are known for the reliability of their systems, the sophistication of their platforms, and the high level of automation in their technical processes. If it happened to them, no one is safe
2. Outages today have a far-reaching impact
Major outages are unfortunately not rare, and they tend to cause huge collateral damage. Whether it be through damaging the brand reputation, causing business losses or causing compliance-related costs, organizations simply can’t afford to let outages continue to happen and be managed the way they do today (although many of them don’t fully understand outages costs and results).
As companies and systems continue to become more reliant on IT, it becomes even more crucial to try and keep things up and running with 100% uptime, and if something goes wrong, fix it instantly.
3. The speed of transformations in the IT leads to vulnerability to changes
Today’s applications complexity, cloud migrations, the move to microservices and containers, and the accelerated rate of changes - all come together to create a huge challenge for organizations.
Add the common shift to DevOps and agile processes to the blend (a shift that creates continuous pressures to accelerate release management), and you’ve got yourself an almost impossible situation to control.
‘What change triggered the problem?’
Changes move through their life cycle concurrently relying on a loose coupling between the system’s components. However, the complexity of modern systems still leaves dependencies resulting in change conflicts, cascading failures and global impact of failures. A developer or an operator simply cannot foresee the full extent of the impact that their changes can produce on their own, without external tools.
The safety nets provided by modern systems (self-healing, a rollback on failure, high availability, etc.) fail to provide a sufficient answer. Outages like the one that was the share of Facebook are a reminder that there is not yet a zero-error zone.
4. The answer is NOT optimized ‘change management processes’ or ‘change automation’
Common attempts for optimizing major issues that were caused by a change (or changes) in the IT environments are focused on trying to optimize the processes - such as optimization of reviews prior to approval, allowing zero manual changes, defining stricter unit testing, etc.
These are typical behaviors that I see in enterprises, following major outage incidents.
Is that good? Sure. Is that a solution? Absolutely not.
Automated processes require automated controls to keep up with the pace of changes, and the complexity of environments make it hard, if not impossible, to predict the impact of changes.
For instance (this is true story that we encountered not too long ago): A change can be made according to all predefined protocols and after calculating-in all the parameters that can be foreseen. Like God intended :)
But it can still mess with a critical performance parameter, in a way that no one could predict.
When that’s the case, better processes will bring no value… (I’ll explain more below)
5. Firefighting is much more costly than prevention
Most times, ‘bad’ changes do not show their impact as soon as they are made. Rather, it takes the systems some time to hit an erroneous condition set by a change. When that happens you don’t even know anymore what was the change that triggered the issue you are dealing with.
‘How come you don’t track and analyse all changes?’
Focusing on tracking changes as the triggers of future issues seems to be much more effective than trying to recognize an issue brewing out of endless performance metrics. When performance is dropping, there must have been something that has changed and started it, and there is very little time to fix the situation, as the issue already started to manifest itself.
For example (another real-life one), you can see a symptom of a growing CPU consumption, which means that an issue was triggered due to some change. Right then you have no idea what’s going on.
Depending on the speed of the response, it could either be intercepted soon after it started, or later on.
However, if the problem was caused (for instance) by a change in the configuration of the maximum number of working threads in the application server, many hours will probably pass before the workload grows and hits a limit, over-utilizing the CPU.
By then, you just can’t tell what went wrong and what happened. No clue.
If you could only attribute the issue you are facing to the change that was initially made and triggered the fault, a future CPU issue would be dealt with as soon as it happens, right?
In other words, if you had the right technology by your side and could track changes in your IT and cloud environments, and even be able to assess the risk behind any change, you could then prevent harmful changes from occurring in the first place, and no less important - you could quickly reverse engineer a change that had already triggered an issue.
6. Small changes can lead to big Issues
In the end, many of the key issues can be narrowed down to small changes of a configuration parameter, database query, firewall policy, routing table or the like. These small and simple changes (I call them granular changes) can create cascading effects that:
a) Make the root cause detection of a failure VERY hard;
b) Lead to a very hard-to-recover-from impact .
Here’s a recent use case taken from one of our customers (a financial enterprise) :
- The customer misconfigured routing. A complete geo area in the company's WAN became inaccessible;
- The business data that was supposed to be uploaded to an application repository in another region started to accumulate in the inaccessible area;
- Once the routing was resolved, the data was shipped to the target server, overloading it and triggering recovery;
- Other regions started to accumulate data during the recovery and the process repeated on a greater scale when the server was recovered;
- A key system went down. The resolution required a manual upload of the accumulated data from each region. The system stayed offline for hours until all the data was synchronized.
7. Granular (actual) changes are hard to find
Here are some real-life examples of harmful granular changes that were detected using Evolven:
- Connection timeout of an application server was changed from 5 minutes to 60 seconds
- Length of a text field in database table was reduced
- Index on a database table was dropped
- Firewall port was closed
- A new Windows service was installed
- Amount of CPU cores allocated for a virtual machine was reduced
These examples perfectly demonstrate the level of granularity that can turn into huge IT incidents. Tracking these granular changes and identifying the exact ones that triggered the incidents is a formidable task.
It took Facebook 14 hours to troubleshoot their poor misconfiguration change. Let’s not forget we are talking about Facebook. If they needed 14 hours, I am not sure how much time it would have taken a “regular” enterprise.
An EMA research determines that “the industry average time to resolve any performance issues is just under six hours, with 20% of IT organizations spending more than 11 hours on each.”
8. When you focus on the symptoms rather than on the real root cause, you are making a big mistake
According to Gartner about 85% of performance incidents can be traced back to changes.
A key problem is the industry’s myopic focus on monitoring only the health of IT and cloud environments (e.g. higher CPU usage, slower transactions, an extremely slow Java query, or in Facebook’s case - a massive outage).
While monitoring these indicators is very important, let’s not forget that they usually represent symptoms of performance incidents and not their true root cause.
And while these indicators can guide IT operations towards the general direction of where the problem originates, they can not independently indicate the actual root cause.
We call it "Symptompia".
What many IT teams are completely missing is the opportunity to gain visibility into the true root cause of performance incidents – which are by definition changes that were made prior to the event and triggered these incidents.
For an effective causal analysis aimed at detecting unauthorized, unknown or inconsistent changes, IT teams must use technologies that track, analyze and score the actual changes being done by people or machines.
The Evolven Connection
A customer recently told me: “it’s 2019, and even with the best monitoring tools in the world we are still facing too many incidents and suffer from long troubleshooting time….It’s time to reframe the problem and take a more advanced approach”.
Evolven Change Analytics offers a unique AIOps solution and is the only technology centered on the actual root cause of performance incidents – changes that were made.
Imagine tracking all actual changes (yes, even granular ones) carried out in your enterprise cloud or other IT environment, analyzing them and tagging them by their type and risk potential. Evolven makes it possible.
Our technology uses patented machine learning analytics for the purpose of correlating the changes it collects with performance monitoring events and with additional parameters that are driven from APM systems.
The result: You get visibility of the actual changes that caused the incident, and can fix it much faster and prevent similar changes to be enrolled again.