A Year in Review: 7 Major Outages from 2012
NetFlix. Facebook. GMail. Amazon Web Services. Microsoft Azure.
Looking back at 2012, some big names in technology not only felt the pain downtime and outage, but also faced the struggle of returning confidence to their service and returning vitality to their reputation. As competition drives the faster release new features and more complex, dynamic operations make infrastructure management more complicated, IT teams face a tornado of issues to stay on top of in order to maintain high performance and availability. We see repeatedly that small unintentional issues, outright mistakes, and configuration changes now have widespread, heavily reported consequences that leave customers publicly questioning the reliability of such services.
Here are 7 very public events that took place due to failures stemming from infrastructure or application issues that either weren't prevented or weren't caught before they had spiralled out of control.
1. Human Error at Amazon Web Services Makes NetFlix Unavailable on Christmas.
In a forthright and direct explanation, Amazon conveyed the cause of the data deletion as having been "deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time."
2. Critical Change Leaves Facebook Out of Reach.
Facebook went down due to a change made to the infrastructure. Sure change happens. In IT operations, change is important, enabling continuous improvement of services. In complex dynamic ecosystems, such as Facebook's IT infrastructure, change happens a lot. On any given day infrastructure is being upgraded, patches are being installed, automated processes are running that alter files and system environments and configurations are also manually being changed. Sometimes these activities are performed correctly and ... sometimes they're not. When they're not, the cause may be identified only when a failure occurs.
3. GMail Crashes Following Configuration Change.
As data center infrastructure becomes more and more complex and the pace of change increases, the challenge of keeping track of changes grows. The outage, according to Google, has been attributed to Google's Sync Server, in relying on a component to enforce quotas on per-datatype sync traffic, failed. The quota service "experienced traffic problems today due to a faulty load balancing configuration change."
4. Microsoft Blames Azure Outage on System Configuration Mistake.
The Windows Azure outage, according to Microsoft, has been attributed to a system configuration mistake affecting customers in western Europe. This resulted in Microsoft's public cloud application hosting and development platform being unavailable for about two and a half hours on July 26th. Microsoft didn't say how many customers were impacted.
5. Knight Capital Doesn't Properly Validate Software Release.
It's clear that Knight's software was deployed without adequate verification. With a deadline that could not be extended, Knight had to choose between two alternatives: delaying their new system until they had a high degree of confidence in its reliability (possibly resulting in a loss of business to competitors in the interim), or deploying an incompletely verified system and hoping that any bugs would be minor.
6. Merging United and Continental Computer Systems Grounds Passengers.
In one of the final steps involved in merging the two airline companies, United reported technical issues after its Apollo reservations system was switched over to Continental's Shares program. United struggled through at least three days of higher call volumes after the meshing of the systems and websites caused problems with some check-in kiosks and frequent-flier mileage balances. The glitch was another in a long string of technology problems that began In March.
7. Netflix, Reddit, Pinterest, Foursquare and Imgur Go Offline Due to Malfunctioning AWS Server.
The root cause of the problem was a latent bug in an operational data collection agent that runs on the EBS storage servers.Last week, one of the data collection servers in the affected Availability Zone had a hardware failure and was replaced. As part of replacing that server, a DNS record was updated to remove the failed server and add the replacement server. While not noticed at the time, the DNS update did not successfully propagate to all of the internal DNS servers, and as a result, a fraction of the storage servers did not get the updated server address and continued to attempt to contact the failed data collection server.