15 Years of Chronic Change and Configuration Challenges
Microsoft. Bank of New York. Google. Amazon Web Services. Microsoft Azure. TD Canada Trust.
Looking back over the years, many big names in various industries have not only experienced painful outages, but also had to struggle with reaffirming confidence in their service and to their reputation.
More and more enterprises have grown dependent on their data centers to support business-critical applications, growing data center complexity, making IT a super-pressured business. Despite advances in infrastructure robustness, hardware, software and database downtime still occurs. Moreover despite large investments of personnel time and budget dollars, manual configuration errors resulting in application downtime cost companies thousands and even millions. Infrastructure vulnerabilities, misperceptions, authorized and unauthorized changes, and even simple mistakes add to the risk of costly downtime events. Data center reliability is critical to business operations, so it's alarming that according to State of the Data Center Survey Global Results(Symantec Sept 2012) "The typical organization in the report experienced an average of 16 data center outages over the past 12 months, at a cost of $5.1 million. The most common cause of downtime was systems failures, followed by human error and natural disasters."
With data center complexity a main culprit for downtime, IT has come to live in dread of failures. Small events are bad enough, but big ones suck the life out of IT staff. One of the most immediate costs of system downtime is corporate image. While varying greatly by business, for some companies, the damage goes beyond monetary valuation.
IT teams face many issues that they have to stay on top of in order to maintain top performance and availability. We see that news headlines have remained consistent over the last 15 years. Change and configuration management challenges are chronic, while the names of the company's have changed, the bottom line remains the same that these challenges lead to critical operational issues that can even reach the news, just look at these representative examples we have compiled from every year over the last 15 years, that took place due to failures stemming from infrastructure or application issues that spiralled out of control.
2013. Misconfiguration Strikes Again Setting Off Google Apps Outage.
The problem, which lasted for about three hours on Wednesday morning, occurred when the main user-authentication system for Google applications was misconfigured.
2012. Critical Change Leaves Facebook Out of Reach.
Facebook went down due to a change made to the infrastructure. In complex dynamic ecosystems, such as Facebook's IT infrastructure, change happens a lot. On any given day infrastructure is being upgraded, patches are being installed, automated processes are running that alter files and system environments and configurations are also manually being changed. Sometimes these activities are performed correctly and ... sometimes they're not.
2012. Merging United and Continental Computer Systems Grounds Passengers.
In one of the final steps involved in merging the two airline companies, United reported technical issues after its Apollo reservations system was switched over to Continental's Shares program. United struggled through at least three days of higher call volumes after the meshing of the systems and websites caused problems with some check-in kiosks and frequent-flier mileage balances. The glitch was another in a long string of technology problems that began In March.
2012. GMail Crashes Following Configuration Change.
The outage, according to Google, has been attributed to Google's Sync Server, in relying on a component to enforce quotas on per-datatype sync traffic, failed. The quota service "experienced traffic problems today due to a faulty load balancing configuration change."
2011. Amazon outage sends prominent Web sites offline, including Quora, Foursquare and Reddit.
Amazon has released a detailed postmortem and mea culpa about the partial outage of its cloud services platform last week and identified the culprit: A configuration error made during a network upgrade.
2010. Massive failure knocks Singapore's DBS Bank off the banking grid for seven hours.
A faulty component within the disk storage subsystem serving the bank's mainframe was resulting in periodic alert messages, which saw a scheduled job to replace it at 3 a.m. that fateful day. The situation spiraled out of control as a direct result of human error in the routine operation.
2009. Widespread trouble with Google Apps service.
Google Search and Google News performance slowed to a crawl, while an outage seemed to spread from Gmail to Google Maps and Google Reader. Comments about the failure were flying on Twitter, with " googlefail" quickly became one of the most searched terms on the popular micro-blogging site.
2008. Gmail outage lasts about 30 hours.
The first problem reports started appearing in the official Google Apps discussion forum around mid-afternoon Wednesday. At around 5 p.m. that day, Google acknowledged that the company was aware of a problem preventing Gmail users from logging into their accounts and that it expected a solution by 9 p.m. on Thursday.
2007. Skype is down.
Skype advised that their engineering team had determined that the downtime was due to a software issue, with the problem expected to be solved "within 12 to 24 hours."
2007. Some Amazon EC2 customer instances were terminated and unrecoverable.
A software deployment caused management software to erroneously terminate a small number of user's instances. When monitoring detected this issue, the EC2 management software and APIs were disabled to prevent further terminations.
2005. Faulty database derails SalesForce.com.
A Salesforce.com outage lasting nearly a day cut off access to critical business data for many of the company's customers in what appeared to be Salesforce's most severe service disruption to date.
2004. Unscheduled software upgrade grinds UK's Department for Work and Pensions to a halt.
Some 40,000 computers in the U.K.'s Department for Work and Pensions (DWP) were unable to access their network last month when an IT technician erroneously installed a software upgrade.
2003. Glitch in upgrade of AT&T Wireless CRM system causes a break.
AT&T Wireless Services Inc. this week faced the software nightmare every IT administrator fears: An application upgrade last weekend went awry, taking down one of the company's key account management systems.
2002. New high-bandwidth application triggers outage.
The network grew very quickly due to business changes and was never redesigned to cope with the much larger scale and new application requirements.
2001. For 12 hours, TD Canada Trust's 13 million customers couldn't touch their money
TD Canada Trust ran full-page apologies in newspapers across the country Monday, saying it was sorry for a weekend computer crash that left millions of its customers unable to access their accounts. The bank said the outage was caused by "a rare and isolated hardware problem".
2000. Software-upgrade glitch leaves flights on the tarmac.
The Federal Aviation Administration (FAA) had called a halt on all flights scheduled to land at or depart from Los Angeles International Airport for four hours that morning. Technicians loading an upgrade to radar software at the Los Angeles air-traffic control center caused a mainframe host computer to crash.
1999. Configuration error brings down Schwab's online trading.
A brownout at Charles Schwab & Co.'s online brokerage last week delivered potent lessons to managers of transaction-intensive Web sites. A configuration error with a new mainframe, added to increase capacity, brought down Schwab's trading system for about one hour as the stock market opened on Wednesday.