Embrace Change as the Secret Ingredient in Improved Observability, Customer Experience, and Business Innovation
“It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change.” — attributed to Charles Darwin
The Illusion of Invariance
Change is the only known constant in Information Technology (IT). The certainty of uncertainty is a fact we face daily, whether we are deploying new code, monitoring existing code, assessing customer experience, or measuring application performance. The stability of what underlies all of these is constantly in flux.
As businesses embrace digital transformation moving manual processes to those that are software-driven they create ever more complex, componentized technologies to deliver on these initiatives. These efforts require an agile infrastructure that constantly changes and adapts to meet the needs of digital business. However, this agility comes at a cost of intricacy resulting in an IT infrastructure that frequently fails, is harder to diagnose, and impacts customer experience. The underlying cause of these failures is often directly the result of a change buried in an endless “sea” of containers, pods, microservices, VM’s, API functions, and more, seemingly invisible to the problem determination process.
But what if we overturn this paradigm and instead embrace change in infrastructure, assume fluidity, and begin to anticipate the risks in change, adopting a culture of change awareness? This can be an opportunity for improved observability, increased product quality, and greater agility in adapting to the ever-changing needs of the marketplace.
Understanding Change as the “Cause” of the “Root Cause” of a Problem
Enhance root-cause-analysis (RCA) processes and tools to automatically determine that a specific “change” is the cause of a failure and that symptoms such as error messages, degradation, poor user experience, and more are a result of that change. Answer the question “what changed” before it is asked. After a significant failure occurs, the first question asked is “did it work before?” If the answer is “yes”, then the next question is “what changed?” All too often the response is a resounding silence.
Failures are most times the result of either insufficient infrastructure resources such as memory, storage, or compute; or are the result of faulty software code or configuration. With today’s emphasis on infrastructure as code or configuration as code, the cause most often ends up being in the software. And yet, many of these errors do not show up in the test environment. Why? Because the test environment is unlike production in a critical way, or the load in production exceeds what was anticipated. Typical tools providing root-cause analysis construct a “causality chain” that tracks a symptom” across a topology back to the initial or root cause of the issue.
This process might determine that the problem is for example a null-pointer exception, a bad certificate, a misconfigured JVM, a patch, concurrency issues in code, or so many other possible causes. RCA analysis at best will tell you these are the cause; however, they do not get to the granular detail of what they mean by the cause. They do not tell you that there is a temporal aspect and that these “causes” are new and directly the result of a change in code or configuration.
Actionable root cause analysis must include the tracking of the changes that are deployed, and by using machine learning analytics, determine if those changes are the true underlying cause of a problem. Adding in this temporal aspect of a change provides a clear path towards what must be fixed to restore service. Without the knowledge of why it stopped working, efforts at repair are just chasing symptoms and not the true root cause.
Recently, Microsoft’s Exchange Admin Portal had a global outage. This outage was for several hours. With the average cost of IT downtime said to be $5,600/minute (according to Gartner) per organization, this had a serious financial impact on the productivity of many businesses. After many hours of troubleshooting, it was learned that a certificate expired. If the process of change awareness was embraced and supported by a toolset with deep change discovery and risk analysis deployed, the risk of an expiring certificate could have been detected and the outage avoided.
Reconcile what is believed to have changed with what actually changed. Change and Configuration Management (CCM) tools administer the process of planned change creation, approval, and execution. However, tracking whether these changes were completed accurately is dependent on the frequency of discovery, its depth, and the technology stack instrumentation. Unfortunately, these dependencies often preclude a full understanding of the actual changes deployed, and an assumption is made that the CCM is accurate.
Not all changes that are made are effectively executed. They may fail or be deployed partially. Some changes may skip the approval process. Sometimes, this is an outcome of expediency in repairing a critical priority one (P1) issue. Restoring services is paramount and the official process is circumvented. Other times this occurs due to the lack of coordination between multiple points of change, those coming from an ITSM process and those deployed via rapid deployment out of DevOps.
To acquire change awareness, institute a process of change reconciliation using a granular change discovery that spans across the data center to the cloud, compares the changes discovered to those that are in the CCM, determines their completeness, detects unauthorized changes, and calculates the risk for each change.
Build a DevOps Closed Loop Sensitive to Change
Build a closed-loop between DevOps deployments and effectiveness and be certain that the configuration of the environment being deployed into is what is expected.
Many DevOps organizations have begun to sacrifice code quality to innovate faster. While this can meet the goal of a rapid cadence of new releases and bring new functionality designed to customers, it can also fail due to a lack of change awareness.
To address this use tooling that scans the CI/CD pipeline, and pre-production environments to determine the actual, expected changes of a change request or planned deployment. After deployment of the new code, generate a detailed change inventory of every actual code or configuration change, who made each change, and a comparison to what was expected. Incorporate this information into the review process for the determination of release success. Have the Site Reliability Engineering (SRE) team review the reconciliation of changes expected via actual for planning greater reliability. This process can greatly enhance the success rate and customer satisfaction of new releases.
Move from a “Build-break-fix” IT Operations process to one that anticipates the risk of failure and takes action to avoid impact. Become proactive and measure risk before you make a change. If you don’t measure risk, you can’t manage risk, and then you are stuck fixing what might have been avoided.
Be The Enabler of Business Agility.
As businesses transform digitally, senior management relies on IT to put in place the processes that enable business growth. While the high velocity of change makes this challenging, it does offer the opportunity for IT to be seen as the enabler of business agility. Incorporating change awareness into every aspect of IT will lead to a more effective, secure, and competitive digital enterprise.
Change is the Root of All Evil…or is it?
Is change the root of all evil? Only if your organization is unaware of changes in configuration, data, code, workload, infrastructure, and topology. But, with the change awareness that Evolven provides, you can now embrace change, capitalizing on it to drive innovation and showcase IT as an enabler of business growth. Evolven delivers change awareness with full discovery and analysis across the technology stack, change reconciliation determining the difference between authorized and unauthorized changes, root cause analysis, and determination of risk. For more information, visit Evolven at: www.evolven.com