Configuration Change Knocks out Gmail
Google's web-based email service, Gmail, crashed on Monday afternoon with widespread outage reports coming from around the world.
DataCenter Knowledge reported, "On at least three occasions, Gmail downtime has been traced back to software updates in which bugs triggered unexpected consequences. "
HuffPost Tech posted "PANIC: Gmail is down, according to perplexed Twitterers and HuffPostTech's own experience trying to access unresponsive accounts."
GigaOM emphasized that "people on Twitter are reporting Chrome crashes left and right, and some of those with more experience on the systems administration and development side are blaming Chrome's crashes on whatever is plaguing Gmail."
Twitter was quickly tweeting about the situation, with some humor-filled updates:
(see more humorous takes on the Gmail outage here)
What Caused Such an a Widespread Outage?
As data center infrastructure becomes more and more complex and the pace of change increases, the challenge of keeping track grows.
The outage, according to Google, has been attributed to Google's Sync Server, in relying on a component to enforce quotas on per-datatype sync traffic, failed. The quota service "experienced traffic problems today due to a faulty load balancing configuration change."
It's troubles like this which very clearly highlight the pressure that configuration management is under. Minor IT problems can slip into the system all the time, both authorized and unauthorized. As shown, an infrastructure the size of Google can be undermined by a seemingly minor change. This means that really any minute mis-configuration or omission of a single configuration parameter can potentially spiral into an incident that can result in an outage, resulting in harmed reputation, angry customers, , legal liabilities, and even financial implications.