Configuration Change Causes Azure Cloud Outage
Coming on the heels of a number of high profile outages that we recently reported (Comcast,Facebook, Google), Microsoft Azure cloud customers experienced an outage lasting nearly 2.5 hours on July 26th. For end-users, two hours of downtime can feel like an eternity for anyone who simply wants to enjoy a movie or share a photo. The event generated press throughout the industry as well as into mainstream news channels, further raising red-flags about the reliability of cloud platforms.
ZDNet reported, "The nearly 11-hour outage that hit Microsoft Azure customers earlier this week was due to a performance update Microsoft made to Azure storage services, according to company officials."
ComputerWeekly elaborated on the significance of the event explaining that "Websites have been sent crashing as a result of problems with Microsoft's Azure cloud computing platform."
GigaOM emphasized the significance of this event and the need to understand the cause, reporting that "A lack of clarity — or even a perception of that lack — about underlying issues, is certainly not good for a company trying to woo enterprise accounts and business applications to its cloud and catch up with public cloud leader Amazon Web Services."
And the story spread over Twitter with tweets like:
Microsoft's slow response to the recent Azure outage left some users wondering if they should entrust critical busin…http://t.co/hGaz3aDYIN— ITechPreneur (@ITechPreneur) November 25, 2014
.@Azure Not making any new friends with yet another outage, this time re: VSO. Considering GitHub.— Michael Larrabee (@dotcommike) November 24, 2014
What Caused Such an a Pervasive Outage?
Companies deploying customer-facing applications to the cloud have learned that they must give careful consideration for how their applications are developed, deployed, instrumented and managed to maintain the level of service and performance that their customers expect.
The Microsoft Azure outage, according to Microsoft, has been attributed to "A configuration change meant to make Blob storage (Azure's cloud storage service for unstructured data) perform better unexpectedly sent Blob front ends "into an infinite loop."
It's troubles like this which very clearly highlight the impact of downtime on companies. With the move to Cloud and SaaS delivery models, both customer-facing applications and an organization's entire IT infrastructure are at risk. Cloud-based offerings need to be managed differently, in terms of configuration management, to ensure that the needs of the customer can be met. The need for good configuration management practices does not end when services (or parts of services) are moved to the cloud.
Problems during configuration changes are often cited as reasons for cloud outages, like seemingly minor IT changes, both authorized and unauthorized, can impact performance, as illustrated by the Azure outage.
Microsoft's Detailed ExplanationIn an Azure Blog post, Microsoft's CVP for the Microsoft Azure Team, Jason Zander, explained that the "interruption was due to a bug that got triggered when a configuration change in the Azure Storage Front End component was made, resulting in the inability of the Blob Front-Ends to take traffic. The configuration change had been introduced as part of an Azure Storage update to improve performance as well as reducing the CPU footprint for the Azure Table Front-Ends. This change had been deployed to some production clusters for the past few weeks and was performing as expected for the Table Front-Ends. As part of a plan to improve performance of the Azure Storage Service, the decision was made to push the configuration change to the entire production service. The configuration change for the Blob Front-Ends exposed a bug in the Blob Front-Ends, which had been previously performing as expected for the Table Front-Ends. This bug resulted in the Blob Front-Ends to go into an infinite loop not allowing it to take traffic."
IT Operations in the Cloud
IT operations in the cloud introduces many configuration management challenges. Although it's reasonable to expect that in the cloud the percentage of faulty changes and time of change will decrease, in absolute numbers, however, the same number of issues remains. Furthermore, incident response becomes a great deal more complex, with lack of visibility into the cloud significantly altering the very fabric of incident response, further complicated by limited visibility from infrastructure abstraction. The rapid pace of change supported by Cloud Computing makes it a major challenge for the enterprise to be able to drive high-powered change while still remaining firmly in control.
CRN added "The widespread nature of the latest Azure outage is a cause for concern, according to Lydia Leong, a vice president and distinguished analyst at Gartner. Leong tweeted on Tuesday evening: "Microsoft's disastrous inability to keep Azure outages confined to a single region is a major red flag for enterprises considering Azure."
Recent Cloud OutagesVendors in cloud computing have not been immune to service issues. Recently, Joyent suffered a major outage (Admin Error Brings Down Joyent's Ashburn Data Center), leaving many customers off-line for hours. Other cloud-based operations (Facebook Outage Caused by Software System Update) raise concern about the cloud's reliability.
Ben Kepes raised in Forbes 'The concern for customers here is that Microsoft's testing would appear to be a little substandard. While the "flighting" approach would seem to be a good idea, in this case it appears to have been ineffective in identifying potential issues. (Microsoft Delivers A Post Mortem--The Reasons Behind The Global Azure-Alypse).
DataCenter Dynamics added Operator Error Made Azure Outage Worse
InfoWorld's Caroline Craig in her smartly title piece, In a cloud outage, no one can hear you scream, raised another side of the story, saying, "Microsoft promised its next update ... in 60 minutes. When experiencing the failure of a critical service, an hour spent without a status update can seem like an eternity. Even worse, as Infosys reported a day before the Azure fail, "over 80 percent of large organizations use or plan to use mission-critical applications on cloud."