Storm clouds: Why IT incidents still occur in cloud environments, and how to successfully troubleshoot them
The cloud revolution, which started a number of years ago, has reached massive volume and impact. Just last year, a survey conducted by IDG discovered that over 70% of companies have at least one cloud-based application or infrastructure.
A previous survey by Forrester found that nearly 70% of decision makers rank the formation of a solid cloud strategy as nothing short of critical.
In 2019 and beyond, we can expect to see the power and influence of cloud computing grow even stronger. Companies that have been using cloud solutions for years are ready to take things to the next level, and those last few who have yet to embrace the cloud are likely to do so.
The IT perspective
From an IT perspective, cloud computing offers many clear advantages. When done right, cloud-based solutions provide enhanced stability, agility, and flexibility around the provisioning, deployment, and patching of applications and infrastructures.
Enterprises enjoy a much faster path towards “zero-touch” IT systems while simultaneously reducing the development, delivery, and administration efforts that otherwise would be required to reach maximally efficient operations.
However, moving to the cloud does not come without possible issues, and, in the following paragraphs, we examine some of the weaker points associated with adopting this transformative technology. In addition, we’ll take a closer look at the IT operational perspective, offering new ways to manage cloud and cloud-based systems.
A cloud hanging over
When major cloud systems shut down, they take a long list of companies with them.
We witnessed a mega episode in late June, when Cloudflare domains experienced an outage that affected Google, Amazon, Reddit and many others. Another sad example was the one affecting SalesForce’s platforms, which suffered an outage that same month.
In an ideal world, the cloud does lead to enhanced stability through auto scaling and self-healing. However, ‘cloud unique’ issues can still take place in spite of these capabilities. Application defects, scalability-related bottlenecks, connectivity issues, data issues, database performance problems, access rights, permissions and certificates complications, security vulnerabilities, and more, all can lead to performance and availability problems.
Here are a few of the more common issues that to which cloud environments can be vulnerable:
- IaaS (Infrastructure as a Service) issues: Possible problems to which this type of structure can be subject include (1) the misconfiguration of an infrastructure component, in which a single problematic patch, for instance, affects the entire system and (2) the failure of a customized infrastructure component due to the difficulty in standardizing the entire infrastructure stack.
- PaaS (Platform as a Service) issues: While this structure eliminates the danger of infrastructure issues such as those mentioned under IaaS above, the level of complexity in application setup management and deployment, which is designed to satisfy a variety of application requirements, makes this structure extremely sensitive to configuration changes.
- SaaS issues: The enhanced transparency to both IT admins and users increases both the complexity and flexibility of these configurations, primarily due to the need for customizations.
- Application-based issues: Microservices architecture can boost scalability and resilience on the one hand but might also expose the system to risk as the number of services grows due to the increasing complexity of the services map and the growing number of interdependencies.
This is not a complete list of the possible risks, but, as it is important to note, many of the listed risks as well as others not included in the list, are attributable to changes (made by humans or machines, automated or manual) in cloud environments. Each change in the code, configuration, workload, application data, or system capacity in the cloud can trigger stability issues and so influence the entire system.
Every cloud has a silver lining
Now that we’ve briefly listed the issues that may occur in cloud environments, we can look at ways to effectively mitigate risks.
Changes are key
The first essential to be gained is full visibility of all changes that have been made across the entire cloud environments. In other words, as issues occur as the results of actions that were made, CloudOps teams must be able to view the entire change history, from development to production, in order to identify which changes precipitated the issues that have occurred.
Once an issue has occurred, a method must exist to discover a correlation between certain changes and a problem they caused in order to quickly recognize the changes to be remediated to resolve the problem.
My belief is that we can’t fully prevent harmful changes from occurring, no matter how much we try. Even though Evolven helps cloud teams minimize unauthorized or harmful changes, history has taught us that they will never be completely eliminated.
Even though incident remediation is much easier in cloud environments (for example, due to the ease of using automated rollbacks), we still need to troubleshoot the problems that trigger these incidents. In contrast to rollbacks, the time issue in cloud environment investigations can become a real challenge. Both the dynamic nature of the cloud and the automated recovery mechanism make reproducing difficult.
The solution of tracing a problem back to its root-cause (which is usually a change in the IT environment) is not a band-aid solution and demands in-depth work and dedicated technologies.
The good news is, however, that this process can be completed before any repeated harm is done, thus preventing additional related incidents from taking place.
Step #1: Tracking, monitoring, and smartly analyzing granular changes
“85% of all performance incidents can be traced back to changes.” (Will Cappelli, Research VP, Gartner)
IT teams know how to manage processes, as they have policies, tools, and advanced technologies to help them to do so. However, they don’t usually know what changes have actually taken place. In that sense, the loop isn't closed. Moreover, cloud-based IT processes are expected to be automated, thereby worsening IT teams' blindness as opaque machines automatically make changes in cloud environments.
In order to overcome this loophole, a technology is needed that not only identifies (i.e., tracks, monitors, and registers) all actual changes but, as the number of such changes is virtually endless, also uses machine learning mechanisms to analyze them and set every change with a “risk factor” based on multiple parameters. This is the only way to enable a cloud team to cut through the noise and identify the potential root-cause changes of an issue.
Step #2: Machine learning based ‘blended analytics’
Besides being able to track all changes in your cloud environment, you will still need to use multiple technologies in order to monitor for performance issues. Once an issue is identified, it needs to be linked back to specific changes that in high probability are its root-cause.
Is this comparable to searching for a needle in a haystack? Without the right technology by your side it is.
Employing other technological tools to which it has access, in a process I call ‘blended analytics,’ Evolven analyzes and links changes made in the specific environment with the occurrence of issues that were reported by these tools.
Machine learning is used for noise reduction, pattern recognition, smart correlations, and more. Then, links to potential root-causes are created.
Every new technology brings major potential for success alongside serious potential damage. Cloud computing is no different. As we witness the adoption of this technology grow, we must also recognize its weaknesses and learn how to strengthen them from within.
The increased flexibility currently available thanks to cloud-based solutions also includes issues that might bring down major organizations and harm many companies and users in the process. Addressing these dangers requires further automation of cloud operations, including intelligent controls and analytics that enable the self-management, self-regulation, and self-healing of the cloud.