Misconfiguration Strikes Again Setting Off Google Apps Outage
This week Google Apps customers reported that Gmail for Google Apps, Google Drive, Documents, Spreadsheets, Presentations and the Admin control panel/API experienced outage issues.
ZDNet reported, "Twitter lit up with people complaining about various Google services being down. At CBS Interactive, where we run Google Apps, the search giant's services also turned up a server error.
CNET posted "Google's online e-mail, file storage, and other services ran into a spate of problems this morning, according to Google's status page -- and lots of frustrated users."
NetworkWorld emphasized that "A malfunctioning log-in system affected millions of people's ability to access a variety of Google applications on Wednesday, including Gmail and Drive"
There were tons of complaints on Twitter, Google+ and Facebook:
(and others raised the question: Google Apps down. How do we function?)
What Caused This Outage?
As data center infrastructure becomes more and more complex and the pace of change increases, the challenge of keeping track grows.
The outage, according to Google, has been attributed to "a misconfiguration of a user authentication system and caused a fraction of the login requests to be unintentionally concentrated on a relatively small number of servers. At the time the misconfiguration occurred, monitoring systems detected a load increase and alerted Google Engineering at 1:08 a.m. PT on April 17. However, the alert cleared and the authentication system operated normally under the current load conditions."
The misconfiguration that set this off didn't end at 1:08 a.m. Google reported on a deteriorating situation "At 5:00 a.m. as login traffic increased, the misconfigured servers were unable to process the load. This began to cause errors for some users logging in to Google services. The request load, exacerbated by retry requests from users and automated systems such as IMAP clients, initially appeared as the cause of the login errors."
It's events like this that highlight the pressure that IT operations faces in managing configurations. Minor IT changes can slip into complex systems anytime, both authorized and unauthorized. As shown, an infrastructure the size of Google can be undone by what seems a minor change. So really this means any minute mis-configuration or omission of a single configuration parameter can push a stable system into an incident state, resulting in an outage, harmed reputation, angry customers, legal liabilities, and even financial implications.