Monitoring Sucks (and What We Can Do About It)
So, one would think that monitoring an infrastructure is one of the most trivial things around, right? But for as long as we can remember, systems administrators have complained about the state of monitoring. Now it's official: IT professionals around the world hate systems monitoring. In fact, the phrase "monitoring sucks" has taken on a life of its own, spawning blogs, an open source code repository at GitHub – even the #monitoringsucks hashtag on Twitter. (see more at "Monitoring Sucks" movement rallies for better systems monitoring tools)
While System Monitoring mostly watches performance and availability, neglecting change and configuration monitoring can result in just as much as suffering to performance and availability.
What's Up with Change and Configuration Monitoring?
So what's the matter with change and configuration monitoring?
- It's a huge undertaking! With literally thousands of configuration parameters per technology – some critical and some less, to manually define, what you're looking for is overwhelming! It is simply impossible to monitor and evaluate every single configuration point in a support stack using purely manual processes. You would need a workforce that is the population of something like Cleveland (p. 478,403) just to perform the necessary system review. That's just not practical. Existing IT support staff already have enough on their plates, just chasing down known problems, and really don't have time to carefully evaluate every single change that occurs on every single supported system. (Entering The Change Twilight Zone And How To Overcome It)
- Partial information. CMDB or other tools layed on top of CMDB provide only partial information. CMDB for example, just focuses on high level information and doesn't go (or it's not practical to go) into granular and detailed configuration information. Other tools focus only on a specific technology, ignoring the full scope of the IT environment. The result: the IT organization misses a major part of the configuration information required for effectively managing complex business systems. (Is the CMDB Dead?)
- Slow and heavy tools. The world has complained about performance and availability monitoring, even while there are relatively simple and even free monitoring tools available, yet people are still annoyed. Even worse, for the area of configuration management, tools are an operations nightmare. IT environments are complex, and these tools are hugely complex undertakings to design, build, populate and maintain - underestimate any one of those aspects at your own peril.
- Lack of automation possibilities. Most configuration monitoring solutions do not have automated collection capabilities. Forrester has assessed that "automating the solution is as critical for handling the scope of change management in today's organizations. Automation offers infrastructure and operations staff a way that can impact IT to be a channel for ensuring that business operations perform at the highest levels. Some very complex applications require I&O to set tens of thousands of parameters. The frequency of change is so great that it results in chaotic modifications. Training and change management solutions must then be used, at a level of granularity that provides a control/validation of configuration parameters and the detection of unauthorized changes. Look to use automation in these complex application deployments and changes." (Forrester: Assessing Complexity In IT Operations)
Why Not Improve Configuration Monitoring?
Why aren't vendors dealing with this issue of obsolete tools? The IT operations world is weighed down with responsibility and tired, just preferring to take half solutions and deal with them, than demand that vendors provide optimal solutions. There's a good reason why large configuration monitoring software doesn't do everything right for everyone. They do a lot, sometimes poorly, and almost always, in a disjointed manner, caused by years of feature creep and parity. There are too many moving parts, layers upon layers of old code (as well as bugs), and just an inflexible approach towards newer development methodologies (e.g. agile). IT Operations today is basically faced with choosing between two imperfect routes for configuration monitoring. On the one hand, there's all the complicated, inflexible and expensive enterprise tools that are heavily promoted and rigged with vendor lock-in. On the other hand, we have an assortment of small system administration tools — many of which are great at addressing specific pain points, but are small pieces of a larger puzzle.
Getting Somewhere Slow
Let's look at the state of IT operations with an analogy I enjoy. Today's configuration management monitoring tools are running like the person who walks to work with a hurt leg, not walking faster than 3/km. Yet, rather than take a car or other available technology and increase the speed for covering this distance, this guy keeps walking and just keeps calling the office that he will be late because of his leg. Much like organizations are doing, they are taking their time and working with older configuration management tools. That's part of the frustration, as IDG reports, "Most monitoring systems collect megatons of data - it's making sense of it that's difficult." (The Top 5 Reasons Monitoring Sucks)
A Configuration Management Dream
When change and configuration is not an integrated part of your infrastructure monitoring solution then you are not aware of everything that is happening that can impact performance and availability.
Data centers can now achieve greater cost-efficiency, which was unimaginable a few years ago, with cost savings through server consolidation, and more efficient operations. Yet there are estimates that as much as one-half of unplanned system downtime, which can add significant strain to IT budgets, can be attributed to configuration problems (Ronni J. Colville and George Spafford Configuration Management for Virtual and Cloud Infrastructures) . So as organizations consolidate their server environments, and increase the number of applications and services their data centers are delivering, data center managers must find ways to stay on top of configuration data without introducing prohibitive operational costs.
What makes a good Configuration Management solution?
So what do we see as those critical characteristics for configuration management that really work.
- Tools with fast and easy setup: An overly complicated management tool setup just adds to the burden of tasks for IT operations, having to carry out fine tuning on the tool setup, heaping more on IT ops' ongoing resource-draining management responsibilities.
- Can automatically adjust to dynamic environments. Tools should be seamlessly integrated into dynamic resource management and automated deployment.
- Gathers granular data and convert with analysis capability. The data center contains a growing body of knowledge with hundreds of thousands of unique configuration parameters spanning various technologies, detailed to a granular level. Intelligent analytics should provide a way to process all this data, shrinking it to manageable portions and understand how data correlates, making the information practical and actionable.
- Can connect to current systems, complement but not be dependent. Since the tools administration burden is exacerbated by having to use a variety of specialized infrastructure management tools, each designed to address a specific operation, having their own setup, configurations, and administrative responsibilities, IT ops needs a configuration management tool that can be integrated into the current arrangement.
- Minimum admin overhead. The tool needs to allow for flexible accessibility, while avoiding admin overhead in infrastructure management.
The new requirements set forth by post-crisis regulation put a squeeze on profitability of banks, forcing them to make new investments in data, reporting, and compliance infrastructure - driving up operating and IT costs. Changes in banking were catalyzed by regulation and are being enabled by ops and IT innovation, seeking radical increases in efficiency and lower-cost operating models.
Standing Strong Against the Pressure
Today performance and availability is affected by changes made from pressure. IT is under incredible pressure, bombarded by many change requests to carry out on complex inter-dependent systems under the intense demands of business to release - now. With any little change able to be the impetus for a high impact incident, it is not surprising that organizations experience painful stabilization periods after releases, and even production outages.
"Most people agree that there are plenty of good tools around if your infrastructure is small-to-medium sized. It starts to be more problematic when your infrastructure grows, when you have more and more items to monitor and more and more items to measure. With the introduction of "infrastructure as code" people want to be able to deploy a service automatically, and also include monitoring in that deployment; that's one area where current tools are not ready." (Monitoringsucks, a #devops thing?)
Vendors need to provide change and configuration awareness, identifying problematic situations for taking action. Configuration management tools need to take a new approach, monitoring activity in real time and identifying affected components. This can mean when server CPU activity suddenly jumps by 60%, then you need to know what type of server it is. If it is the application server, then you should be able to identify which one, what is the configuration, how it changed recently and how different is it from the servers running with normal CPU utilization.
Managing IT environments with intelligent automated analytics will drive more sophisticated proactive processes like, for example, comparing environment states, validating releases, and verifying changes consistency to provide today's IT operations with actionable operational information helping to prevent or identify critical issues. So rather than continue to feed a bloated system, we should strive to simplify, rebuild and implement configuration management based on intelligent analytics, and turn the situation around from 'monitoring sucks' to what we can do about performance and availability.