Google Unavailable due to Configuration Change
Some Google cloud users experienced disruptions for 45-Minutes beginning Sunday March 8 at around 10 a.m. PST, leaving users scratching their heads about what was happening to their Google services. Services such as Google search and Gmail stopped working for users around the world including the UK, Netherlands, Iceland, France, and India. For some, Google services went down altogether, while others suffered intermittent outages. Explaining what happened, Google confirmed that "The root cause of the packet loss was a configuration change introduced to the network stack." Users vented their frustration and publicized the issue with Google.
Google Docs, please wake up this morning #downtime— Gary Ng (@gary_ng) March 13, 2015
Google Outage: Internet Traffic Plunges 40% http://t.co/zDst6RnPrf— ICBINGP (@ICBINGP) March 15, 2015
For complex distributed systems, like Google's IT infrastructure, change happens a lot. Applictions are rolled out, infrastructure is upgraded, and configurations are even manually changed. While these activities are usually performed properly ... sometimes they don't turn out as expected and the cause may only arise when a failure occurs. Since functioning systems don't just stop working for nothing, change is at the heart of issues and problems. Smooth operations can be impacted and expose business systems to risk by numerous types of changes (infrastructure, application, workload, data, etc.). Every time a change occurs in the infrastructure, the stability of IT environments may be effected, whether for the deployment of new hardware or applications, or even some minor change. When an organization can confidently manage change on a continuous basis, this provides the confidence required to help ensure that their services are available.
What Caused the Outage?
Google's outage was due to a configuration change in the network stack. The outage, according to Google, was explained as "The root cause of the packet loss was a configuration change introduced to the network stack designed to provide greater isolation between VMs and projects by capping the traffic volume allowed by an individual VM. The configuration change had been tested prior to deployment to production without incident. However as it was introduced into the production environment it affected some VMs in an unexpected manner." Valuable time was lost according to The Inquirer, reporting that "Google engineers became aware of the problem some 18 minutes after the packet loss made itself known. Fixing it meant rolling back to an earlier configuration." Problems due to configuration changes are often cited as reasons for outages, where seemingly minor IT changes, both authorized and unauthorized, can impact performance, as shown by this outage.
IT Operations Analytics
Not only at Google, but IT Operations in general face new levels of challenges in change and configuration management due to the complexity and dynamics present in today's operations. As seen from the Google outage, organizations suffer tremendous public embarrassments as well as impact to their bottom line, when services are not available. With the operational data explosion, IT Operations Analytics solutions are emerging to provide insight into operations performance. Gartner analyst, Will Cappelli recognized IT Operations Analytics as an area on the rise saying, "We have taken a position that by 2018, 25% of Global 2000 will have deployed an IT Operations Analytics platform taking data feeds from variety of performance and availability systems, that's up from about 2% today."