Is Downtime Choosing You?
Recently Paul Venezia, senior contributing editor of the InfoWorld Test Center, wrote in his blog The Deep End that 'In IT, downtime chooses you'. He observes that in the IT industry, "even amid the best-laid plans, systems can go down without good reason, so do what you can -- and remember to take a breath."
While Venezia shared how his downtime experience on New Year's Day actually afforded him a little rest, nevertheless he felt his IT troubleshooting instict kick in "mulling over the various data points of the problem, looking for correlations or clues that would fix the problem -- all of the normal trappings of an IT ninja in a crisis."
So this made us think about how hard downtime and outages still hit the modern data center, despite advances in technology, and how the business feels the consequences and impact of downtime.
Are you prepared for the next outage?
The problem is that business processes, applications and computing infrastructure are too intertwined and dependent on each other. If the infrastructure isn't configured just right or is unavailable, the business process stops.
High profile failures stemming from infrastructure or application issues.
As data center infrastructure becomes more and more complex and the pace of change increases, the challenge of keeping track of changes grows. The outage, according to Google, has been attributed to Google's Sync Server, in relying on a component to enforce quotas on per-datatype sync traffic, failed. The quota service "experienced traffic problems today due to a faulty load balancing configuration change."
Downtime impacts reputation and loyalty.
A recent Gartner study projected that "Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues." The fallout from the Amazon cloud outage added to fear surrounding cloud security and downtime. And as Amazon continued to scramble to get its cloud services back online, many customers questioned the reliability of the cloud, Amazon's communication around the outage and whether they would be compensated for the downtime as part of their SLA.
Costly data center outages still caused by human error.
Operations teams are faced with the added challenge of ensuring accurate error-free application releases and with appropriate configurations during promotion and deployment, taking into account configurations that are inherently different between pre-prod and production. Even with the availability of automated deployment solutions, this still doesn't ensure that environments are properly configured.
Downtime and outages don't leave the news headlines.
The seemingly preventable internal glitches also raise havoc for business consistency, as "an outage in United's Unimatic software used to control its ground operations delayed flights nationwide, causing a furry of passenger backlash a week before the busy Thanksgiving travel weekend and denting the No. 1 U.S. carrier's image."
Unexpected traps bedevil IT Operations.
Unauthorized change are uncontrolled business risks. Though slight changes may seem fairly innocuous, when a server is potentially accessed thousands of times per day due to a change demanding dynamic content creation, this could bring the server to its knees. Take the faulty configuration change to the routers on a company's DNS network. This can cause requests for access to a company's Web sites to go unanswered, requiring hours of investigation to pinpoint the issue. The misconfigured files would need to be replaced in order to return, traffic to and from the affected Web sites to normal.
Sound familiar? "Last night's deployment didn't go as planned"
There they are. IT Ops is caught in a vicious cycle. They are too busy putting out operational fires. The constant need to do fire fighting doesn't give them any time to avoid fires in the first place. This situation has come about because operations need to handle a complex software stack on various platforms including physical, virtual and cloud.
Recovering from the IT Operations hangover.
No, system admins aren't getting dead drunk and blacking out about what happens in their IT environments, but they can arrive at work in the morning and see that their smoothly functioning infrastructure is now in a complete mess. Of course, they can only say 'what happened?'