It’s Time to Turn and Face the Changes…
(Turn and face the strain)
(Oh, look out you rock 'n rollers)” – David Bowie
“It's been a long
A long time coming
But I know a change gonna come
Oh, yes it will” – Sam Cooke
Well, yes, it is time to face the strain of changes as they are indeed the cause of our biggest headaches, and no matter what…they are “gonna come”. Of course, I am referring to the domain of Information Systems (IT) problems.
It is time that Infrastructure and Operations (I&O) leaders inclusive of those practicing DevOps principles follow the road-worn advice of David and Sam and embrace change, including its dimension in root cause analysis automation. Otherwise, we may happily think we have solved a problem, but the true underlying cause is lurking, biding its time, and joyfully waiting to wreak havoc when you least expect it.
Root Cause Analysis
Root cause analysis (RCA) is a methodology that works to determine the initial, underlying reason something has failed or incurred latency. Why is this important? You need to know what to fix and which team to “assign” to fix it. And better, yet what to do to prevent it from happening again. According to the IT Infrastructure Library (ITIL), a problem is “the cause” or explanation of why an incident has occurred. It is critical to make this determination as fast and accurately as possible as this “problem” is the actual item that needs repair. Without this knowledge, time will be wasted chasing “ghosts”, with little to no impact on the resolution of the real problem. Determining “the why” a problem occurred is just as important as resolving the incident. Agile self-healing environments may rapidly resolve the issue via an automated rollback; however, the problem may take days to investigate and during that time there is a risk it may again rear its head and impart customer impact.
Most tools providing RCA ingest log files, consume events, and trace application transactions. The RCA process attempts to trace the path back across the topology from an observed effect to the initial reason for the failure, namely its cause. These tools often take this approach by “walking” the topology “stitched” together from distributed traces to determine how a non-optimal condition occurring in a node might propagate across an edge to create an observable incident e.g., an application down, latency, or error condition at other nodes.
Where RCA Went Off the Rails
So far so good, but…there is an underlying false assumption in RCA automation that configuration items (CI’s), artifacts, and topology are static and don’t change. Of course, we know they do as we have IT Service Management (ITSM) processes to create change requests. And we know the first thing an IT specialist will say when they hear of a new problem is, “What changed?” And yet, there isn’t any standard for defining what might be called “change events” that should be consumed as part of the RCA process as a possible explanation or root cause of a problem.
While some RCA tools look at changes in workload performance, they are focused on the impact of these changes and miss the causes that lie in configuration, code, and data. Changes in workload performance do not happen without an underlying cause. Configuration, code, and data all have their own measurements, scope, importance, and risk markers. Failure to consider these differences results in an inability to determine causality.
Observability tools, those most currently in vogue for RCA analyze telemetry sourced from the three dimensions of logs, metrics, and traces. The tools have inherent limitations. They only analyze the IT full stack vertically, essentially reviewing the “ever-present now” for cause and effect. However, this leaves out how things were before and how the change from “before to after” can be the cause of a problem.
There are tools that fall under the AIOps category that attempt to look at change requests and automated deployments using time-based correlation. Unfortunately, the effectiveness of these tools is somewhat limited in that they fall into the trap of confusing correlation with causation. Just because something happened in the same timeframe doesn’t necessarily mean that one is the cause of the other.,
Use Change-Aware Root Cause Analysis to Improve Production Reliability
I&O leaders can improve the reliability of production, ensuring it delivers expected results by determining the risk or probability that a change is now or is likely to be in the future the root cause of a problem. It’s important to remember that reliability and availability are not the same things. Availability measures the percentage of time a computer system is accessible by users. While necessary, reliability goes further and measures the percentage of time a computer system delivers its intended function. A system may be available but unreliable, such as when an application is delivering the wrong information, e.g., providing incorrect product pricing to a user. Even worse, an application could be available, but won’t accept any new orders – that’s not a reliable system.
Evolven RCA promotes reliability considering the state of the full-stack now and how its elements e.g., CI’s and their configurations, topology, and code, changed over time analyzing each within the specifics of their respective domains and unique risk markers. This adds a horizontal point-of-view to RCA. To make this work, Evolven extends standard Observability to add a new fourth-dimensional aspect to its analysis, namely the horizontal dimension of “change” over time.
Evolven can prevent impact from problems by providing RCA that considers change over time as a potential root cause. It tracks all the Configuration Items (CI’s), and their configurations, and analyzes the “actual” changes that are deployed to these CI’s. Evolven goes beyond vertical analysis, adding a horizontal examination of change specific to the unique attributes of each type of data that has changed. Using artificial intelligence machine learning (AI ML) analytics determines if these changes are the actual underlying cause of a problem and the degree of risk that may be incurred.
The solution can also integrate with existing monitoring toolchains and correlate their analysis with the discovery of changes to determine risk and deliver actionable insights into the root cause of problems occurring now as well as problems that may happen in the future.
Adding in the dimension of change provides a clear path towards what must be fixed to improve reliability. With clear insight into the changes that caused an application to stop producing expected results, I&O leaders can immediately act to improve production reliability.
An example of a reliability problem occurred recently at one of Evolven’s clients. Performance was severely degraded on several production servers running a critical business service. The unfortunate impact of this included SLA penalties, user dissatisfaction, lost revenue, and an enormously expensive effort involving over 40 subject matter experts (SMEs) attempting to troubleshoot the problem.
Evolven analyzed the situation, comparing performant and non-performing servers. Its automated analysis determined that the difference in performance was caused by the recent installation of a new security software agent on the non-performing servers. As a result of Evolven’s investigation, the clients turned off the new security agents, and performance was restored. An RCA tool unable to consider the dimension of change as a potential root cause would not have provided the solution to this reliability problem.
Evolven provides businesses with the insight to determine the root cause of problems that were due to risky changes. It also delivers the capability to foresee the risk of changes leading to potential problems, and the prescriptive advice to prevent impact before there is damage.
Evolven’s technology will help your business maintain highly reliable services to your customers, move from being reactive to making risk-informed decisions, and prevent risky changes from impacting the customer experience. And when it’s time (as David B. said) to “turn and face the strain of changes”, Evolven will help you stay ahead of problems, and manage reliability, compliance, and security risks. And who knows, maybe you’ll even get home a bit earlier. Not a bad thing, at all.
Contact Evolven here to see the Evolven Change Control technology in action.