1 (866) 866-2320 Resources Events Blog

Misconfiguration Strikes Again Setting Off Google Apps Outage


Misconfiguration Strikes Again Setting Off Google Apps Outage


blog google apps down2eThis week Google Apps customers reported that Gmail for Google Apps, Google Drive, Documents, Spreadsheets, Presentations and the Admin control panel/API experienced outage issues.

ZDNet reported, "Twitter lit up with people complaining about various Google services being down. At CBS Interactive, where we run Google Apps, the search giant's services also turned up a server error. 

CNET posted "Google's online e-mail, file storage, and other services ran into a spate of problems this morning, according to Google's status page -- and lots of frustrated users."

NetworkWorld emphasized that "A malfunctioning log-in system affected millions of people's ability to access a variety of Google applications on Wednesday, including Gmail and Drive"

There were tons of complaints on Twitter, Google+ and Facebook:

(and others raised the question: Google Apps down. How do we function?)

What Caused This Outage?

As data center infrastructure becomes more and more complex and the pace of change increases, the challenge of keeping track grows. 

The outage, according to Google, has been attributed to "a misconfiguration of a user authentication system and caused a fraction of the login requests to be unintentionally concentrated on a relatively small number of servers. At the time the misconfiguration occurred, monitoring systems detected a load increase and alerted Google Engineering at 1:08 a.m. PT on April 17. However, the alert cleared and the authentication system operated normally under the current load conditions."

The misconfiguration that set this off didn't end at 1:08 a.m. Google reported on a deteriorating situation "At 5:00 a.m. as login traffic increased, the misconfigured servers were unable to process the load. This began to cause errors for some users logging in to Google services. The request load, exacerbated by retry requests from users and automated systems such as IMAP clients, initially appeared as the cause of the login errors." 

It's events like this that highlight the pressure that IT operations faces in managing configurations. Minor IT changes can slip into complex systems anytime, both authorized and unauthorized. As shown, an infrastructure the size of Google can be undone by what seems a minor change. So really this means any minute mis-configuration or omission of a single configuration parameter can push a stable system into an incident state, resulting in an outage, harmed reputation, angry customers, legal liabilities, and even financial implications.

Google's Detailed Explanation

Google delivered an official explanation on the cause of the outage that affected Google Apps. PCWorld summed up the incident, explaining that "The problem, which lasted for about three hours on Wednesday morning, occurred when the main user-authentication system for Google applications was misconfigured. The improper configuration, introduced on Tuesday, caused log-in requests to be funneled to a small number of servers, which in turn ran out of capacity, and the overload caused them to malfunction.."

Today's IT Operations

For the the data center, change is a constant. Despite how IT has tried to manage change, doing so gracefully and efficiently is still one of the most challenging aspects of IT operations, making the management of change and configuration problems a chronic pain for IT operations. Overwhelmed by massive volume, velocity and variety of change and configuration data, this persistent problem is really a big data problem. 

IT Operations Analytics

Today as IT operations faces new levels of challenges, IT operations leaders are looking for new ways to deliver more value to the business. Tools for effective decision making can improve the infrastructure and operations (I&O) team's ability to allocate resources and address complex activities. Learn how Evolven's new IT Operations Analytics approach delivers the intelligence that IT operation organizations crave, allowing them to turn piles of configuration data into actionable information.

Your Turn
Are YOU staying on top of configuration changes?

About the Author
Martin Perlin