Driving Enterprise-Scale DevOps: Super Sensors for an Autonomous Autobahn
There's a lot of uncertainty about driverless cars and trucks on the roads today, even though we can predict that autonomous vehicles will soon be able to negotiate trips with greater coordination, speed and safety than their human-operated counterparts.
Perhaps fear remains for good reason, because in complex traffic scenarios, any vehicle can encounter sudden changes at a velocity where physical things like bumpers, crumple zones and airbags, and even steering and brakes may not save the passenger, or a pedestrian.
At the highest speeds, sensors, predictive adjustments and telemetry—and the ability for a skilled driver to intervene—become far more important for autonomous car safety than all the physical countermeasures you can think of.
What does this have to do with DevOps? Just like our future-self-driving cars, we’re seeing a plethora of autonomous tools associated with DevOps. Each has its own style of automation, data review, AI/ML learning, error checking and sensory capabilities targeted at accelerating the continuous delivery of software, and its associated operational infrastructure.
What will it take to make autonomous systems safe and reliable enough for enterprise scale DevOps adoption?
An Intelligent Real-Time Ops Lane
Many experts in the autonomous vehicle space predict that in the near future, driverless cars will have their own dedicated lanes on any high-traffic roads. Since AI-based drivers wouldn’t need to deal with the unpredictable reaction times of human drivers, a dedicated autonomous lane could allow them to merge, ride and exit the flow of traffic without hesitation.
The parallels between this autonomous traffic lane concept, and the automated deployment and software and infrastructure pipelines offered to today’s DevOps teams are too strong to ignore.
DevOps is already in the fast lane for change, with the adoption of Agile methodologies, CI/CD pipelines, and deployment into hybrid IT environments, with ephemeral containers, microservices and auto-scaling infrastructure. Startups continually deliver highly functional and performant cloud-native applications way faster than any time in history.
But this speed also has profound implications for critical enterprise services.
Even tech titans who ‘move fast and break things’ like Facebook can still go down for the better part of a day, while trying to isolate the root cause of a configuration change on one server.
If the primary goal of autonomous IT automation is pushing changes at ever higher rates, the fine balance between speed and control is violated, leaving stability, compliance, and security at risk.
Self-Regulate, or Be Regulated
After a pedestrian fatality in Arizona, later investigations revealed that Uber’s self-driving test cars were not programmed to recognize unexpected changes, such as jaywalkers that cross the street outside of a crosswalk.
Even if autonomous driving fatalities are proven far less frequent per mile than the 30 thousand or more annual traffic deaths in the US due to everyday human operator errors, it only takes a few such accidents before regulators and local governments take notice.
We’re seeing similar issues for the adoption of DevOps-style automation in the enterprise, especially in regulated industries that have a strong mandate for compliance and control over processes and data: financial, healthcare, telco, pharma, government agencies, and so on.
Putting a thoroughly automated CI/CD pipeline in place can rapidly accelerate the pace of change, but what happens if there are unexpected outcomes of the changes. Is there a blind spot for change data that all the tests and monitoring solutions miss?
One major insurance company meticulously defined IaC (infrastructure as code) environment builds into the delivery process of their heavily used commercial application. While this increased the release frequency and performance of the app, intermittently, the build process would add firewall policies in the wrong order, with the effect of disconnecting transactions in an entire cloud region.
Because the problem wasn’t easy to repeat at the application layer, and the systems in the affected region seemed to operate normally, and the order of these changes weren’t readily observable, it took several weeks for the firm to uncover the problem.
In the meantime, they fell away from agile, fully automated DevOps practices, going back to more rigid ITIL-style formal change requests and documented audit trails.
They no longer trusted the change management process underneath their large ongoing business the way a startup could, so they self-regulated to a slower, but safer method to avoid compliance issues.
Kind of sad to go back to the slow lane, though. It seems we need to automate the change management and control process with as much vigor as we automated delivery, if we want to mitigate the risk of speeding up again.
What’s missing here is change telemetry: the ability to measure the actual IT changes happening now, as well as see changes coming up within the line-of-sight of the ongoing software and IT Ops change management processes.
Check the Sensors: Change Telemetry for Vision at Speed
Autonomous cars ‘see in the dark’ using LIDAR and other similar forms of systematic vision, where light waves or lasers are bounced off objects in their path to provide telemetry for driving.
Much of the AI training of autonomous cars, therefore, centers around machine learning based on image recognition and location-aware feedback of changes in the car’s environment. The more it can learn to recognize the significance all of its sensory inputs, the more effectively it can conduct telemetry, to look ahead and react as an ideal human driver would with the safest possible driving course.
For DevOps to work within large enterprises, teams similarly need strong change sensors and telemetry within their automated systems to navigate at speed, and AI analytics, trained to recognize the significance of these inputs and react.
One large financial company has several thousand application development and operations resources, as well as integration partner resources with a hand in the continuous delivery processes.
The institution’s fintech engine was a paragon of automated DevOps, with comprehensive automated test suites, container orchestration, observability and APM tooling down to logs and traces, and AI-based network security. But with so many application workers independently driving their updates through the CI/CD pipeline, the extended enterprise environment started to break down, with several small server instances crashing.
The firm employed a change management solution from Evolven, to ingest all of the incoming alerts and key data from their suite of observability tools, while monitoring for actual changes occurring within each environment. As the Evolven solution builds up a comprehensive machine learning dataset for change, it provides telemetry for autonomously navigating to, and proactively identifying or remediating the root cause of the most significant errors in the software release and IT upgrade paths, before they may cause issues at the point of change.
In this case, they were able to look ahead to uncover certain incompatible server settings and namespace conflicts that would only occur when automated deployments were concurrently executing changes at a massive scale.
The Intellyx Take
Enterprise IT leaders always want their organizations to become more agile but face significant risk and frustration in embracing autonomous DevOps solutions.
Companies can lean on the safety bumpers of sophisticated automated testing, performance management, monitoring and security solutions. But at enterprise speed and scale, the change telemetry to navigate around real-world change data, and make real-time course corrections becomes far more important.
Despite our best aspirations to reach that asymptotic goal, there may never be a ‘100-percent self-driving’ IT function. Autonomous solutions need to augment the intelligence and responsiveness of humans to change, resolving many issues on their own, but also providing change telemetry for others, so DevOps teams can quickly identify the most important crises and take the wheel.
If enterprises can augment complete automation suites with trustable, forward-looking senses for IT change telemetry, they can start to lift the approval gateways that inhibited agility, and safely merge back into the autonomous DevOps fast lane.
©2019 Intellyx LLC. Intellyx retains final editorial control of this article. At the time of writing, Evolven is an Intellyx customer. None of the other companies mentioned in this article are Intellyx customers. Image source: Evolven (licensed)