Configuration Glitch Sets Off Google Outage
Recently a broad outage knocked out Gmail and a slew of other Google Web applications on January 24th, leading many affected users to flood Twitter, other social media sites and discussion forums with complaints.
Mashable reported, "Judging from the response on Twitter, the outage appeared to be worldwide and affected other Google services as well. Hangouts and Google+ both stopped working; some Google Drive users were reporting connection issues, too."
ComputerWorld posted "The company later reported on the Apps Status at around 3 p.m. that more than 10 other services were also having problems, including Calendar, Talk, Drive, Docs, Sites, Groups, Voice and Google+ Hangouts."
CNET emphasized that "Google quickly updated its apps status dashboard to reflect that Gmail was down. Officially, the company flagged the outage as a "service disruption," and not a "service outage," although that's probably little consolation to people who weren't able to access their Gmail."
There were tons of complaints on Twitter, and even Yahoo:
(and the glitch even caused thousands of emails to be sent to one man's Hotmail account)
What Caused This Outage?
Data center infrastructures have evolved, with new technologies being added to further optimize and secure the environment. Yet the net result has been a high degree of complexity that limits IT's ability to respond to changing business requirements.
The outage, according to Google, has been attributed to "An internal system that generates configurations — essentially, information that tells other systems how to behave — encountered a software bug and generated an incorrect configuration. The incorrect configuration was sent to live services over the next 15 minutes, caused users' requests for their data to be ignored, and those services, in turn, generated errors."
The configuration issue that set in motion the sudden crash of multiple Google services was widely felt,"Users began seeing these errors on affected services at 11:02 a.m., and at that time our internal monitoring alerted Google's Site Reliability Team. Engineers were still debugging 12 minutes later when the same system, having automatically cleared the original error, generated a new correct configuration at 11:14 a.m. and began sending it; errors subsided rapidly starting at this time. By 11:30 a.m. the correct configuration was live everywhere and almost all users' service was restored."
With downtime and outages due to configuration issues still making headlines, the chronic nature of change and configuration management challenges is more apparent than ever. Minor IT changes can slip into complex systems anytime, both authorized and unauthorized. As shown, an infrastructure the size of Google can be unhooked by a relatively minor change. So really this means any minute mis-configuration or omission of a single configuration parameter can push a stable system into an incident state, leading to an outage, that harms reputation, creates angry customers, and even has deep financial implications.
Today's IT Operations
For the the data center, change is a constant. IT teams need to support a wider variety of applications running on distinct platforms and are facing complex operations to manage today's enterprise data centers, fighting to overcome change and configuration management problems that affect performance and availability.
IT Operations Analytics
As IT operations face new levels of challenges, IT Operations Analytics platforms are emerging to enrich a wide variety of IT management use cases. An ITOA platform can feed application performance data to a central event management system to simplify monitoring and hasten root cause discovery; provide granularity detail on configuration changes in IT environments and address configuration drift; and identify suspicious changes that can reveal performance risks.