Accidental Change at Amazon Takes Netflix Offline
After the recent Gmail outage, closely followed by downtime at Facebook affecting millions, and generating a few thousand more tweets, Netflix went offline on of all days...Christmas Eve. Yes just as millions of families were gathering together and planning and watching a movie of their choice on the on-demand video platform, there was no response.
For some the outage lasted as long as 12 hours, an amount of time which is nothing to sneeze at, especially on the biggest day of the year for people to get together and watch a nice movie, enjoying the conveniance of streaming it straight on their TV. As events unfolded, Netflix pointed the finger at their cloud host, Amazon AWS, who furiously worked to restore the service and the Amazon AWS dashboardprovided updates on the progress (just for the record the problem was discovered, fixed, and Netflix is back at its prime).
The Wall Street Journal depicted how users took to Twitter and Facebook to complain as the Netflix outage spread, reporting that "Kenneth McIver, a Netflix member for the past four years, was among those affected by the outage on Monday. 'I was with my family most of the day,' said the 44-year-old Atlantic Beach, S.C., resident. 'I came home to relax and watch movies; I tried several times, but I gave up.'"
Wired elaborated on the significance of the event explaining that "Christmas Eve is a big movie-watching night, so while it's not clear how many Netflix customers were affected, Monday's outage came at a bad time. 'It's been hours & it's Christmas Eve. It's classic movie night!!!' wrote one Netflix user on Twitter."
InformationWeek added that "The outage hit Netflix viewers from Canada to Brazil. It also affected Amazon's own Amazon Prime video-streaming service and Salesforce.com's Heroku cloud platform, which served up HTTP errors and ssl:endpoint unavailability messages during the outage."
So did Amazon become the Grinch that literally stole Christmas cheer and long-awaited family time, or was this a legitimate hiccup in cloud services?
So What Caused this Outage?
Netflix went out due to a change made to the AWS infrastructure. Amazon reported during the outage that the problem came from an issue with the the Elastic Load Balance, a part of its service that helps spread heavy traffic among multiple servers to prevent overload, saying "We continue to work on resolving issues with the Elastic Load Balancing Service in the US-EAST-1 region. These issues are affecting updates to both existing and newly created ELBs."
Yet as many Netflix subscribers were left without access to their media, Amazon didn't immediately report the cause of the issue with the Elastic Load Balancing Service.
Forbes reported that "this event follows a six-and-a-half hour outage on EC2 two weeks ago. And one of the selling points of the Cloud is that there are redundancies to prevent just such occurrences. A small step backwards, perhaps, for cloud computing."
Amazon's Post Mortem: Human Error
Amazon's official statement dissected the event and filled in the missing information, giving a full picture of what sparked this event. Amazon clarified that "a portion of the ELB state data was logically deleted."
In a forthright and direct explanation, Amazon conveyed the cause of the data deletion as having been "deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time."
Unidentified Critical Changes
Not only did the accidental action by the Amazon developer working on the production environment initiate the problems, but it took Amazon awhile to pinpoint the actual root cause.
Amazon explained that at first they ran down the wrong path, saying, "Over the next couple hours, our technical teams focused on the API errors. The team was puzzled as many APIs were succeeding (customers were able to create and manage new load balancers but not manage existing load balancers) and others were failing. As this continued, some customers began to experience performance issues with their running load balancers. These issues only occurred after the ELB control plane attempted to make changes to a running load balancer...During this event, because the ELB control plane lacked some of the necessary ELB state data to successfully make these changes, load balancers that were modified were improperly configured by the control plane. This resulted in degraded performance and errors for customer applications using these modified load balancers."
IT Operations Analytics
Not only at Amazon, but today IT Operations face new levels of challenges that can no longer be handled with existing scripts or manual approaches. This means applying some more serious brain power to help deal with the complexity and dynamics of today's IT environments. IT Operations Analytics delivers the intelligence IT operations organization crave, allowing them to turn piles of IT operations' data into actionable information, and identify when changes occur and if they pose risk.
The recent Gartner Hype report noted "A new generation of IT operation analytics technologies is emerging, providing IT the insight to address problems early on. In doing so, IT professionals are preventing the fire drills that result in MTTR (mean time to repair) focus and metrics. In turn, IT should have more time to prevent performance incidents from occurring at all, pursue these preventive "fixes" in an orderly and efficient manner, and, ultimately, devote more time to optimizing the use of technology for business gain."