Downtime, Outages and Failures - Understanding Their True Costs
When it comes to mission-critical applications, and the performance of the data center, companies put a lot of cash to see results, however, the investment doesn't always deliver the hoped-for outcome.
Confronting system downtime
Despite all of the efforts invested in infrastructure robustness, many IT organizations - included huge enterprises - still face database, hardware, and software downtime that lasts anywhere between just a few minutes to several days, completely incapacitate the business.
The world of IT failure is strange, if you ask me. Despite the advanced solutions and the mounting statistics that touch nearly every major enterprise software vendor and customer, from ERP to CRM and more, just bringing up the topic of outages still terrifies those in the industry. Yes, terrifies. ut somehow IT failures have become an accepted, even expected if you'd like, aspect of the enterprise life. How come?
IT downtime revisited
So, while IT professionals have to confront downtimes from time to time, and are focused on trying to get on top of it, the business organization as a whole suffers from the ‘financial pain’ of downtime.
In the past, we took an in-depth look at the multiple ways that IT downtime can badly impact the bottom line of enterprises (you can read more about it here - Cost and Scope of Unplanned Outages). We considered parameters from direct loss of revenues through reputation damage to decrease in productivity.
Now, I wish to revisit the issue and see how organizations of any size should address and assess threats to their IT operations, including systems, applications, and data, by looking at solid numbers that represent the potential costs behind downtime and outages.
Measuring big brand failures
When should the industry start measuring those recent big brand outages such as the one that recently hit Facebook, the one that hit hundreds of thousands Lloyds Bank customers or the Jetstar outage that took out check in and resulted in flight delays?
In other words, at what point is an outage ‘significant enough’ so that a cost analysis becomes valuable to the industry, in order to learn from it and predict the future impact of other outage incidents?
Downtime costs vary significantly between industries, especially due to the different implications of downtime on internal and external processes. The affected business size is obviously a critical factor, but it is not the only one. Setting a numerical value behind an IT outage means predefining its implications across multiple business and organizational aspects.
A failure of a critical application can lead to a couple of loss categories:
- Loss of the application service – the impact of downtime varies according to the application and the business.
- Loss of data – the potential loss of data due to a system outage can have significant legal and financial implications.
Now, everyone would surely agree that today's data centers should never go down; applications should be available 24/7 around the clock, and internal as well as external end-users worldwide must be able to rely on data centers’ availability (for critical data and application availability) at anytime. Yes, that’s the kind of standard you would expect these days. But still, in the back office (meaning inside the data center), reality is different. You can’t expect things to never break down.
The worst system outage nightmare ever? Probably the one that happened to you…
Some outage incidents turned into PR catastrophes, like the mythological Virgin Blue debacle from 2010, or the recent Facebook one (mentioned above).
You can imagine how mad the customers of Virgin Blue were when they couldn't board their scheduled flights, during an outage that lasted up to 11 long days. The event fired up plenty of negative press, and cost the company millions.
The result: Virgin Blue's reservations management company, Navitaire, ended up compensating Virgin Blue for more than $20 million. (Navitaire booking glitch earns Virgin $20M in Compo)
But there are tons of other, ‘lesser’ incidents that still manage to capture the attention of the media. Here’s a recent article by USA Today about the Wells Fargo outage that prevented customers from accessing their accounts for many hours.
But let’s not forget that the loss of others feels completely different from a loss that pays you a visit.
In that sense, we can safely say that anyone in the IT industry would agree that outages or downtimes are VERY bad for business. They are unwanted, very harmful financially, must be fought against using all available resources and shouldn’t be underestimated.
Misconfigurations have a major impact on performance
The IT Process Institute's Visible Ops Handbook reported in the past that "80% of unplanned outages are due to ill-planned changes made by administrators ("operations staff") or developers." (Visible Ops).
Getting to the bottom of the matter, the Enterprise Management Association reports that 60% of availability and performance errors are the result of misconfigurations. Yes, these small changes that are constantly implemented in environments and system configuration parameters
Downtime can cost companies $5,600 per minute and up to $300,000 per hour in Web application downtime (according to a 2014 Gartner's analysis).
Average hourly cost of enterprise server downtime, worldwide, 2017-2018:
While application maintenance costs are increasing at an annual rate of 20% a previous industry survey revealed that at least one-quarter of polled downtime was caused by configuration errors. (How much will you spend on application downtime this year?)
How common are downtimes of outages?
Source: Data Center Knowledge
Production and application downtimes costs made clear
Unplanned outages are the responsibility of IT to resolve. Nevertheless, at the end of the day, these outages are essentially business issues that impact the entire organization.
An important part of a thorough evaluation process is to calculate how much money you will lose per hour (or minute, or any other time increment of your choice) of downtime.
For enterprises with revenue models that depend solely on the data centers' ability to deliver IT and networking services to customers – such as telecommunications service providers or e-commerce companies – downtime can be particularly costly, with the highest cost of a single event topping $1 million (more than $11,000 per minute)
In a previous USA Today survey of 200 data center managers, over 80% reported downtime costs that exceeded $50,000 per hour. For over 25% of them, downtime cost exceeded $500,000 per hour.
According another survey, while companies can't achieve zero downtime, one in every 10 companies said that their availability must be greater than 99.999%.
Source: Searchcio Techtarget
To get a firm understanding of the implications of production and release downtime, let's take a look at how the consequences of downtime are manifested.
Downtime cost - per year or per incident
A 2017 study revealed that out of 400 IT decision makers, 46% experienced more than four hours of IT-related downtime over the past 12 months; 23% said that they incurred costs ranging from $12,000 up to more than $1 million per hour. Over 35% admitted that they are unsure of the cost of an outage to their business
If you ask Delta airlines, which had to cancel 280 flights due to an outage (2017), the losses driven of a single incident can reach over $150 million
Simply put, you can use industry averages to estimate your hourly cost of downtime. You can then lean on other benchmarks to predict the number of expected annual downtime hours, and simply multiply.
A couple of years ago, Dunn & Bradstreet reported that 59% of Fortune 500 companies experience a minimum of 1.6 downtime hours per week.
This means that if you take the average Fortune 500 company (or a company that employees at least 10,000 employees) and let's assume that it pays an IT team members an average of $56 per hour, including benefits ($40 per hour salary + $16 per hour in benefits), then the labor part of downtime costs for an organization of this size would reach $896,000 per week, which translates to more than $46 million per year. (Assessing The Financial Impact Of Downtime).
How had things change from the past?
So, we already know that downtimes and outage incidents still happen today, and the industry as yet to succeed in abolishing them. But how has their cost changed over time? Are they less harmful today?
In 2010, a research by Coleman Parkes found that IT downtime incidents collectively cost businesses more than 127 million man-hours per year - an average of 545 man-hours per company - in employee productivity.
In 2009, it was reported that the average downtime costs vary considerably across industries, from approximately $90,000 per hour in the media sector to about $6.48 million per hour for large online brokerages. (How Much Does Downtime Really Cost?).
According to a survey of IT managers conducted during these years, companies are becoming more aware of the direct financial costs of computer downtime. The survey revealed that one in every five businesses loses $12,000 an hour through systems downtime. (Companies count the cost of IT failure)
As mentioned above, a 2014 Gartner analysis mentioned $5600 per minute and over $300k per hour.
Even as early as in 2004, a conservative estimate from Gartner pegged the hourly cost of downtime for computer networks at $42,000. Accordingly, a company that suffers from a worse-than-average downtime of 175 hours per year can lose more than $7 million annually. However, the cost of each outage affects each company differently, so it's important to know how to calculate the precise financial impact.(How to quantify downtime).
As the cost of outage only gets higher with time, you can understand why past data should be multiplied by a significant number in order to reflect today’s reality.
Every minute counts
Over ten years ago, the average cost of data center downtime across industries was valued at approximately $5,600 per minute(Unplanned IT Outages Cost More than $5,000 per Minute), a figure which, according to Gartner, remained the same until 2014. The aforementioned past study by the Ponemon Institute calculated the minimum, median, mean and maximum cost per minute of unplanned outages, based on input from 41 data centers. The greatest cost of an unplanned outage was found to exceed $11,000 per minute. On average, the cost of an unplanned outage per minute is likely to exceed $5,000 per incident.
Things are only getting worse
A 2013 study saw an uplift of over 41% from the past averages described above, and an average of more than $7900 per one minute
However, an ITIC survey from 2015 clearly showed that the hourly cost (compared to data from 2008) has increased by between 25% to 30%!
Downtime per year
Past data: Gartner has calculated that downtime can reach 87 hours per year. Obviously that's the sum of many outages - anywhere from a few minutes to several hours. But at the end of the day, this becomes a staggering figure on an organizational level. (Average large corporation experiences 87 hours of network downtime a year).
How things have changed?
Another research from 2011 revealed that, although the industry has managed to successfully fight the downtime epidemic and decrease these averages, we are still seeing significant downtime hours and huge revenue losses.
Downtime impact on reputation and loyalty
How much is your reputation worth? This may be extremely difficult to assess, considering the long-term effect of a damaged reputation and its impact on revenue and profitability.
In this case, downtime costs include lost customers (both short and long term), and other tangible elements that reflect the costs of reputation impairment like stock downturns, marketing hours (crisis and brand recovery management) and media dollars required to reboot and polish up an organization's profile.
What parameters impact the calculation?
When trying to estimate the cost of downtimes, there are the obvious direct costs (such as loss of business during downtime). there are, however, many indirect costs such as reputation issues discussed above, or employee overhead. Workforce overhead is derived from specific tasks that focus on getting the IT systems back up and running, the cost of being delayed with all other tasks, employee overtime expenses (if applicable), and more. Then there’s the value of data loss, emergency maintenance fees (particularly if the outage occurs during off hours), and additional repair costs that may continue long after the service has been restored.
Needless to say, you must calculate these costs when you estimate the implication of downtime, as they are usually very significant; but even making a rough guesstimate can prove to be extremely beneficial for understanding the risks and deciding on the required level of technology you should lean on, in order to fight the risks. And there’s also the issue of lost sales. To have an accurate assessment of the total lost sales, the impact percentage must be increased to reflect the real lifetime value of customers who permanently defect to a competitor. For instance, the Facebook (also Whatsapp) outage that I mentioned above led to over 3 million (apparently Whatsapp users) that migrated to Telegram. I got tons of notifications about my contacts joining Telegram during that time. I am pretty sure you have, too.
Intangible costs vary among organizations. Downtime can result in lost opportunity, shaken customer loyalty, damaged reputation, and lowered employee morale.
Thought it's hard to put dollars behind so many of the parameters described, these indirect costs are substantial and significant. For instance, when Amazon.com went offline for several hours during its early days, its stock dropped by 25% in a single day (Cost-Unconscious: Denying the True Cost of Network Downtime)!
In this example, the fallout from the 2011 Amazon cloud outage added to the fear from cloud security and downtime. And as Amazon continued to scramble to get its cloud services back online, many customers questioned the reliability of the cloud, Amazon’s communication surrounding the outage, and whether they would be compensated for the downtime as part of their SLA. As for the SLA, despite the almost-four-day outage, Amazon's EC2 SLA was not breached (Seven lessons to learn from Amazon's outage).
The cost of downtime: Calculating it yourself
How much will you lose from an unexpected downtime of your servers or business applications?
According to multiple sources, the simplest way to calculate potential revenue losses during an outage is by using this equation:
|LOST REVENUE||=||(GR/TH) x I x H|
|GR||=||gross yearly revenue|
|TH||=||total yearly business hours|
|H||=||number of hours of outage|
How to minimize outage and downtime risk?
Evolven Change Analytics is a unique AIOps solution that focuses on changes - the true root cause of performance incidents. Evolven helps enterprise IT and Cloud Ops teams to prevent and troubleshoot incidents before the trouble starts.
Contact us to see how we help leading enterprises slash the number of incidents and MTTR.