Change Awareness: The SRE’s Best Friend
We’re witnessing the rise of the SRE (Site Reliability Engineer, in original Google parlance, but now service reliability engineer to others) job title alongside the growth of the DevOps movement.
It seems SREs and DevOps roles are inexorably mixed now. Visit any company practicing DevOps and operating at scale in production, and they will tell you they are looking to recruit a few unicorn SREs.
The job recs are pretty heavy though -- ‘Seeking software engineers skilled in all current development and delivery platforms, who also know operations inside and out. Sysadmin types who can respond to tickets and keep everything running flawlessly, while working with the DevOps team to break down silos, define infrastructure as code, and release faster and faster.’
Sounds easy enough, right?
Not so easy. While becoming an SRE is likely a financially rewarding career move, it’s not an easy one to make and has serious hazards -- chief of which is the difficulty of catching up with the ever-increasing pace of change.
Keeping up with change
If DevOps is considered a team sport, the SRE on each team might be the goalie, the stadium facility manager, and the play review referee up in the booth all rolled into one. SREs are generally associated with ensuring highly available ‘day 2+’ operations (or maybe ‘minute two’ at these release rates) of any software the DevOps teams put into production.
Let’s hear from Stephen Thorne, an actual SRE lead at Google, in his “Tenets of SRE” blog:
“In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).”
Quite a lot of responsibility to chew on there, especially when you consider the level of ephemeral containerization and continuous automated deployment going on at a cloud-native leader like Google. Change management is just one of seven job functions of this role.
Then, with Paven Belagatti in his “Devops vs. SRE - The Dilemma” blog:
SRE deals with monitoring applications or services after deployment to practice where automation is crucial to improving a system’s health and availability. She or he considers the role after the design work of a software developer.
Clearly the rise of the SRE is a necessary response to post-deployment challenges of maintaining critical production services at scale in the modern IT world. So many traces, metrics and alerts to monitor. So many tickets and pages to answer.
But there’s one common oversight -- all too often, the SRE talks about change management as a post-mortem to the actual change itself. They are tracking down “what went wrong” rather than seeing the changes exactly as they happened.
SREs caught in the middle
SREs are drawn from the ranks of development or operations, but they are often some of the most cross-functionally talented resources in IT.
As a participant in DevOps, SREs are expected to collaborate effectively and break down organizational silos with empathy -- striking that balance between tapping the brakes for operational safety, and punching the gas pedal by equipping their colleagues with better delivery tools and faster issue resolution.
For example, SREs are subject to several competing forces:
- Velocity: Speed is essential to the DevOps transformation. Smaller, autonomous teams push change into the release pipeline ever faster -- sometimes with multiple releases a day, concurrent code branches and feature flags, or gradual blue-green staged deployments.
- Automation: SREs have the constant pressure to ‘automate all the things’ for efficiency: all environment configurations, builds, test and staging runs, deployment, and reporting. There’s no gain in velocity without tons of automation, but often the script or tool conducting the automation itself creates another potential point of change or failure to watch.
- Change control: Reliability requires resiliency to perform under any conditions the world throws at an application, but it’s not the outside world that causes most problems, issues usually happen because something changed within the enterprise’s computing estate. SREs must sometimes make the unpopular call to prevent releases or changes, and often need to roll back changes to a stable state as quickly as possible.
- Governance: Ultimately, everything the SRE does will need to be rolled up into an SLA (service level agreement) target for customer success, as well as possibly reporting into a compliance framework for industry or regulatory bodies. Even the executive boardroom wants system health reports, as IT is often the lifeblood of the business. All this documentation represents a ton of work.
All of these forces of change may make some IT old-timers in SRE roles long for the deliberate old days of ITIL, where you at least had systematically documented requirements and rigid approval gates as safeguards for each change.
Three ways to achieve automation with change awareness
Since we’re not slowing down, nor going back in time, how can SREs achieve better situational awareness of all changes, and avoid high-pressure issues?
1.Make automation change intelligent.
Whether infrastructure, operations and delivery automation are defined as code, or handled by a platform, SREs are expected to continuously improve operations efficiency and reliability.
Your automation must be intelligent enough to not only observe and report performance metrics and failures, but provide insights you can act upon.
DevOps teams should instrument builds and deployment pipelines with APIs that allow all change events to inform issues in ITOM and tickets in ITSM workflows. Issues are rapidly traced back to the source of changes that caused them, so the SRE can collaborate with the right people to mitigate that change.
For instance, any function of the change control solution from Evolven can be called with an API, which gives issue resolution teams a common view into changes. The SRE can query Evolven from a ServiceNow or JIRA ticket, then use the change manifest to execute a rollback until that failure isn’t there.
2.Take a proactive approach to managing environment change.
Any serious enterprise environment has a lot of moving parts, so SREs will still need to budget for error resolution efforts. But why wait until something fails?
If you want to minimize the risk of changes, you need to set a standard not just for how changes are authorized and executed, but thresholds that can distinguish between a ‘likely safe’ change, and a ‘likely risky’ change, which alerts the SREs dashboard.
3.Drive out inefficiency, and drive out inconsistency.
Reliable services require consistent environments. A relentless focus on automation can decrease cycle times by reducing rote or repeatable work, and specifying a consistent, golden state environment, every time.
But there’s also a human factor at play in complex cloud-native or distributed service environments -- which have dozens or hundreds of possible contributors or stakeholders.
Inevitably, these constituents, even if they mean well, make changes that were neither planned nor requested.
A major financial services firm uses Evolven to trap for any such inconsistencies, so if an Ops team member quietly reboots a virtual server, an OS gets a ‘routine’ update, or a developer inadvertently leaves a testing agent turned on, SREs are alerted.
Often such unplanned changes can be remediated with automated rollbacks without paging anyone in the middle of the night, so SREs can investigate and remediate any concerns in the light of day.
The Intellyx Take
It sure would be nice to have fully automated, lights-out change management, but given the rate of innovation in enterprise IT environments, that’s a pipe dream.
Intelligent automation, and intelligent change control today is much more about empowering DevOps teams, and the SREs that keep everything humming, to focus on what matters: application reliability and customer experience.
With observability and insight into change, the SRE can get ahead of change, automating the prevention of many change-related issues, and aligning IT resources in times of crisis, to take action on any changes that affect critical applications.
©2019 Intellyx LLC. Intellyx retains final editorial control of this article. At the time of writing, Evolven and ServiceNow are Intellyx customers. None of the other companies mentioned in this article are Intellyx customers.