Where Exactly Didn’t You Change Anything?
Getting a Full Picture of IT Operations
This is a guest post from H-C Boos, the founder and CEO of arago AG in Germany and evangelist for IT automation.
Let me begin by saying that I am involved in many kinds of research projects and often I see that too much information can obscure simple truths and keep easy solutions well hidden. Having said this I also often find the problem of too little information in IT operations environments. This is hardly because the information is not available or cannot be retrieved. Mostly it is because the space of all available information is unorganized and completely inconsistent. Don´t worry, this is not a sales pitch for CMDB projects or anything of the like. I just want to give you an insight why I went about finding proper information sources for automated IT operations in a fairly unconventional way.
IT Operations In Autopilot Mode
Those of you who have followed my blog or my timeline on Twitter know that I am promoting IT operations in autopilot mode. But what does that mean? Well quite simply it means that a machine should be taught all our administrative experience and should apply this experience by itself and find a way to resolve tasks in incident- problem- and change-management in a trial and error manner – based on best practice, which in turn is a fancy name for well organized experience or knowledge. This machine would plough its way through analyzing a task, gathering all the information necessary to execute it and then take the necessary action. That is, if we are talking about an optimal environment. In reality the autopilot for IT operations works much more like a normal admin – you know those guys you laugh about when you watch ‘The IT Crowd’ on YouTube . He is given a badly described task with too little information and has to use his gut feeling (experience) and fiddle his way through all possible actions, avoiding further damage and finding the information necessary to finally take the proper course of action necessary to perform the tasks.
A Long List of Trials and Errors
An effectively run modern IT environment has low tolerance for failure, misadministration or misunderstandings. ITIL governed operating environments normally means getting your act together and finding out what the task is really about and finding out all the details and then go out and do what needs to be done.
In a standard administrative approach this becomes a long list of trials and errors or a long list of very small steps towards a final solution. This is why resolving an incident in a complex environment often takes hours. Not because the know-how for resolving the problem is not available, but because there are millions of places to look and find out what exactly was wrong.
You can see that a machine having the same experience as a bunch of expert administrators will be much faster in circling in on possible solutions, but such a machine may also be much harsher in trying out different solution possibilities. If you could give the machine just enough information – and that normally means more than is available from the task itself – you would dramatically shorten the trial and error time this machine would take.
Having All the Configuration Parameters
Well of course the same is true for our fellow human administrators. If you give them all the information you will most certainly allow them to resolve tasks much quicker. But that means the guys working with the information are very well organized, structured thinkers and do not rely on their experience as much as they rely on hard facts. Ever seen these guys sitting in your organization? Sure, but not really as administrators. Administrators are the guys who magically fix things because they have a hunch and because they are willing to take a risk and give the hunch a try.
Giving a good administrator a tool that would allow him to look at all the configuration parameters available in the environment allows him to focus on the task. Showing him a history of changes will help him greatly when dealing with incidents and problems – since usually there is not just a single root cause, but a bunch of small changes and inconsistencies that thrown together create an avalanche of events and in the end disrupt stability.
More than ten years ago, when we at arago set out to tweak our autopilot for operating on basic IT environments, we sawt that there were two things anyone – machine or human – needs in order to successfully operate an IT environment:
- A model of the things you are supposed to keep in shape
- Information about the state of the environment you are responsible for
So knowing what you are supposed to operate and what it is like at any moment is essential. Only AFTER you have this information is any kind of experience, or knowledge of any value to you.
Finding out that there are many solutions to keep track of the desired state, we decided to step away from monitoring and just interface to all kinds of monitoring solutions. The model has been a bigger challenge and still is. Most auto discovery solutions had terrible results. to begin with and most manually maintained models are either inaccurate or just too expensive to maintain at a good quality. Even though discovery solutions like IBM TADDM have shown that big steps to improvement can be taken, there is still a big gap to be filled. There are some very good modeling approaches we have found – practical and simple ones. Take the one propagated and implemented at German Rail´s IT subsidiary DB Systel with the iTop Manager.
Overall Business Oriented Approach For IT
I believe that an overall business oriented approach for IT is the most effective because it does not simply create an overload of information but gives the information business meaning and criticality. Well a whole science has evolved around that topic with BSM – and if you are interested in finding out the state of BSM you better catch up on Adventures in Datacenter operations. If you have such a practical – and if possible business oriented – model in place you can go about administrating your environment in an organized way. You can even apply autopilot for IT operations technology and cut down your manual work by 30% to 80% while making much better use of the talent you have in your company.Yet who will answer all the detailed questions while your admins or your autopilot tries to go forth and do the nitty-gritty work for you. Well they can find out themselves, right? Yes that is what they do and that is why you are unhappy about meantime to resolution or other KPIs and that is why you need ‘system admin appreciation day’, because you have them find out what the hell you were talking about when creating an incident or a task for the operations team. Especially when you have to ask the question “where exactly didn‘t you change anything?”
Finding Out Additional Information
While pondering how we could make our best practice approach for providing autopilot IT operations with a model of the IT better, we came up with a simple solution - Keep the model or the CMDB as simple as possible but have a clearly structured and well maintained set of tools to find out additional information. This applies especially in finding out what has changed as that is inexplicable compared to all the changes you process properly in your IT operations department. In fact that is the most popular use of discovery solutions like IBM TADDM or EMC´s (pardon me VMWare´s) ADM right now – discovering change. These tools were mainly made to show the whole model.
Granular Configuration Automation
While trying to perform a certain task, we are not looking for great abstraction, but well organized detail and detailed specialist knowledge. This is where a granular knowledge oriented approach like Granular Configuration Automation comes in. Even though it is a big vision to understand all configuration parameters for all kinds of applications and software it is better to have some information at this level of detail for some components in your environment than having to find out all that stuff yourself – by trial and error.
So I am quite fond of the idea that at least for certain components in today’s IT infrastructure and landscape we can know exactly what parameters are responsible and what effect and how this effect relates to a task. This is why I gladly agreed to write this guest post for Evolven. Their approach shortens the path needed to actually resolve given issues quite a bit, because you have an overview of the details at the time you want to know and not simply all details at your fingertips if you bother to ask the right questions.
When trying to find out why a certain task is important or where certain results come from, ignorance is a blessing and having a view of the big picture – preferably from the perspective of the business users – is a great thing. But when you are knee deep in the resolution of a problem, the analysis of an incident or the planning or execution of a change, more detailed granular and logically connected information is essential.You can never fix a problem in a set of parameters that a new deployment changed when you have no clue that these parameters exist.
About Chris BoosBoard member of arago AG. One of three founding members of the Frankfurt, Germany-based arago Institut für komplexes Daten-management AG. Author of many academic and business publications, focusing on automation in IT operations as well as information modeling, winner of the John F. Kennedy National Leadership Award and other IT and business awards. Studied computer sciences at ETH Zürich and the Technical University Darmstadt. Link to Chris Boos - Twitter http://twitter.com/boosc
- LinkedIn http://www.linkedin.com/in/boosc