Just another Typical Day in the Crazy Life of IT Operations?
In my travels and business encounters I come across a lot of different IT professionals and see, first-hand, just how different data centers function. Nevertheless, I am continually surprised when I see how the IT operations team struggles to juggle incoming requests, new projects while fighting to maintain stability and performance of the business systems. This alone doesn't shake me up, but rather the fact that for many in I&O have just come to accept this situation as a norm and simply bear with this daily struggle, without even seeing how to improve or enhance the state of their data center management.
So I wanted to consolidate the input and impressions I had into a profile of life in IT Operations.
A Typical Day
Early morning. The office is still quiet. Our IT Operations Manager picks up his first cup of coffee of the day. Before even taking a sip, the telephone rings with a strained voice of one of the other operations guys saying: "Something is not working! Last night's deployment didn't go as planned."
In today's complex IT landscape, IT Ops has to stay on top of many channels and systems in order to maintain stability and keep the infrastructure performing as needed, so that customers throughout the organization can reliably focus on business. This lends itself to the feeling that those working in IT operations are inclined to see themselves as "heroes", keeping enterprise networks and applications going by carrying on the battle and tackling those ongoing, day-to-day problems.
However, this environment is being heated up further by the efforts of agile software development, racing to meet ongoing and changing demands of today's business services. So for these IT ops folks, this means they now have to manage complex processes and keep a hectic pace of deploying builds, releases and patches.
I saw this at one of our customers that provides key financial services. As we know today, agility comes at a price, where some of the changes implemented rapidly might impact stability. So as releases and updates became more frequent, they had to make a lot of changes to the environment, with much more flexible change and configuration management procedures in place.
So instead of having that cup of coffee in the morning, IT operations ends up reacting to a phone call, email, or other notification, that customers can't connect to their applications, or simply the system is not working, throughout the day. IT operations needs to make sure that the business keeps on running and that business transactions are being executed properly and on time.
Putting Out Fires
There they are. IT Ops is caught in a vicious cycle. They are too busy putting out operational fires. The constant need to do fire fighting doesn't give them any time to avoid fires in the first place.
This situation has come about because operations need to handle a complex software stack on various platforms including physical, virtual and cloud.
I see how IT ops continues the day, struggling to stabilize the environment, put out the fires, and simply try to build the good will and credibility required to be able to have meaningful conversations about how to address the organization's strategic needs.
Not only does the IT Manager need surround himself with talented, self-sufficient personnel who can provide the foundation required to help you grow the organization, but they need the processes and the tools necessary to make qualified decisions in this high-pressured landscape.
Usually, everybody can see when a team is drowning in critical incidents. But what is often not seen are all the important tasks being postponed because of those urgent fixes. "What?" the IT ops person yells, still hungry and staring at a cold cup of coffee, at lunchtime "A customer needs help with an über-pressing concern, and it has to be handled right now or the system will explode? Yeah! I'll get on that right away!"
IT Ops has become accustomed to working in a high-stress manner with constant time pressure and having to respond to after-hours incidents. I've heard one say, "Server uptime has to always be on 24-7/365 days a year. So when you get contacted that a server went down, you have to try to get to work to fix the problem.
Many of the issues you need resolve in after-hours could be prevented if maintenance and other operational tasks would be executed precisely and consistently If you don't, problems could snowball throughout the data center and effect thousands of servers."
Configuration Data Flying at Record Speeds
The software developers are always pushing the limits; and SLAs demand a religious affinity to maintaining IT Operation's credibility. Still, in this hectic life, it is essential that the IT systems keep performing at peak levels, with IT operations striving to maintain uninterrupted services.
The task is complicated by the immense quantity of configuration data affecting various elements of the IT environments comprising servers, OS, databases, middleware, applications and more. IT teams have to not only scan through, but also analyze diverse piles of configuration data.
I've seen many IT groups put in hours to identify the causes for a breakdown and in order to restore service. The problem is available monitoring tools only see symptoms. "You touched what?" IT ops yells into the phone. This is the beginning of some very common telephone calls.
Pinpointing the Problem
In the large, enterprise-class organizations that I've visited, the investigation process frequently starts when the service desk is called. Once they pick up the telephone they need to find out what happened.
This can happen when IT implements a system upgrade, making changes to the environment. IT administration goes through an entire established set of processes (even outlined in ITIL), and still in the end, the application doesn't function as planned.
So IT operations needs to go back and check the processes that the upgrade went through. Nevertheless, performance still lags.
Then they need to go into the fine, granular details and see every step, identifying the make-up of even minor changes, seeing how the deployment to every server occurred, reviewing the consistency between servers, trying to understand whether there has been additional interference to the servers. They take this enormous amount of configuration data and granular changes, then try to pinpoint what was the root cause.
Yet, even when IT ops makes some headway in the investigation, they get the usual answer from possible suspects: "I didn't do anything that can affect anything. You can check, everything is written down and documented."
The War Room
If the problem is severe enough, then the IT ops will gather in an operations room – The War Room – for a more interactive process.
When performance indicates that something is about to go terribly wrong, IT operations are expected to prevent it before it gets any worse. So they set up The War Room, call everyone in, and start carrying out their detective work to find that needle in the haystack – the root cause of the issue. This entails many discussions. Calls. All the while, they know that if they don't find it and the situation can only grow worse, impacting business. IT environment breakdowns delay projects, irk customers, interrupt creative flows. On top of this, SLAs may have clauses to penalize delays, which could cripple profits.
Finally, when IT discovers the source, there is that 'hit yourself on the forehead' moment, where someone admits, "Oh I forgot I did that."
When IT ops is in 'Putting out fires' mode, they bother everyone, trying to get to the bottom of things. Or on the other hand, they might be ones creating fires.
The other side of the problem that I often see is that as IT ops carries out changes to restore performance, they invariably affect another group. Then you become the suspect, and they will be calling you.
Dream: With One Click To Have The Answers
So, after many hours, stress, meetings, and countless false leads, IT operations finally get to that cup of coffee.
Really, understanding your IT environment shouldn't be rocket science. Well-planned IT operations management can reduce the harmful effects. The problem that I have seen is that IT Operations has adhered to an older, static-process driven paradigm. IT needs to apply an analytics-based approach (like what we have seen business do with BI) for their own operations, otherwise IT will continue jeopardize system stability, and possibly expose business to devastating risks.
Technology and business are moving at a swift pace and companies need the IT organization to provide actionable data for making important business decisions. Similarly, in the IT organization, operations needs to know what is happening now. Otherwise, IT Ops can find itself stuck, trying to adjust static processes while keeping track of and handling dynamic events, and getting caught off-guard when other issues arise.
So which tools could IT operations use to find out that there is a problem, identify the root cause of it, and resolve the issue?
IT operations needs to change the way they do things, by approaching this situation with dynamic analytics, for dealing with all the changing data, and to see what is really happening. This goes beyond those few designated indicators that were usually watched in various monitoring tools. That is very difficult when you're putting out fires. To raise the urgency of the important things you need to prevent fires in the first place. You need to identify and prioritize potentially critical events, with complete documentation that automatically records what happened.
I introduced one of our customers to a whole new approach to managing IT, the idea of collecting configuration changes, analyzing them for impact, and making the IT ops team aware of the corollary effects of these changes. The plan was that every morning they would verify that all the changes that were moved to production during the night were applied consistently on all the target servers exactly as verified in pre-production or staging. This meant that IT Ops should be adopting an Analytic Driven Management approach, similar to how business has adopted BI, extracting actionable information out of mountains of configuration data to help decision makers respond efficiently.