5 IT Operations Lessons Learned from Jurassic Park
It seems that system outages are inevitable when working in IT. Just about every IT professional has been through an IT failure. Where you realize something your company relies on -- which you support -- is now down, out, or dead. Whether it's email, Internet access, a server, or a storage platform your organization depends on, a vital link is severed when systems go down.
A prime example of system failure for exploring is what went down at Jurassic Park.
Yes, Jurassic Park, Spielberg's classic movie about a wildlife preserve full of cloned dinosaurs where the park's risk management fell flat. At the center of the movie, complexity is one of the main sources of risk. In this post I examine the change and incident management blunders that contributed to the park's failures, and how this applies to IT Operations.
What Happened at Jurassic Park?
So just as a quick refresher, the founder and CEO of bioengineering company InGen, John Hammond, created the dinosaur theme park called Jurassic Park on Isla Nublar, a tropical island in an isolated Central American bay. During a tropical storm, Jurassic Park's computer programmer, Dennis Nedry, who was bribed by a corporate rival to steal dinosaur embryos, carried out an unauthorized change by deactivating the park's security system. Jurassic Park represented some great and amazing application of modern technologies, but ultimately some pretty serious failures.
Change Management Lessons at Jurassic Park
It was really a series of change management issues that happened there:
Item one fifty-one on today's glitch list. We've got all the problems of a major theme park and a major zoo, and the computer's not even on its feet yet.
Even after 'the incident' at the beginning of the movie, ramping up proper responses to handling glitches and issues were delayed because of all the complexity in operating Jurassic Park. As the character Ian Malcolm (an outside expert brought in to assess Jurassic Park's vulnerabilities) noted, complexity is the source of the risks at the center of Jurassic Park. People had limited visibility into what was happening at all levels. Some of the indicators were wrong, limiting durability to really understand what was going on.
Sounds familiar for the IT world? The world around IT today is rich in complexity, which was relatively scarce in decades past. Applications and architectures designed for a poorly connected world are ill equipped to deal with today's and tomorrow's abundance of data and rising demands for changes and updates.
There is so much complexity in IT environments that IT operations don't have the necessary visibility anymore. They aren't going to get that visibility by continuing to use the same old tools and practices of manual discovery as well as relying on tribal knowledge. Just as Donald Rumsfeld (then US Secretary of Defense) described having 'unknown unknowns' what we don't know that we don't know are the things that can undermine and surprise IT operations.
In response, IT operations has built defenses against these 'unknown unknowns.' Yet how do you defend yourself against something that you don't know? This is really hard to do, and puts into place a culture of fear leading to overreaction. (The Real Problem is Visibility)
The system's compiling for eighteen minutes, or twenty. So, some minor systems may go on and off for a while. There's nothing to worry about.
With considerable dependence on automated information systems, which removes a human's judgment and ability to intervene, put too much trust in the "elegance of design." For example, the driverless cars were cutting edge, however, they were totally dependent on having ALL systems working. Once power was brought back up, it was on backup generators. The main power was supposed to come back on, but the system never did that, bringing on yet another failure.
In release deployment, one would assume that by automating all deployments, everything runs as planned – no surprises, right? Not exactly. Often during updates to the configuration, that is when errors from automated processes have been implicated in unexpected performance or availability degradation, where a server, network device, or application was misconfigured, jeopardizing business-critical services. For deployment automation, where only repeatable activities are automated, IT operations has been left wondering what an automated platform actually does, and what would be impact of changing deployment assets, and ultimagely what is the actual configuration of their managed environments. (Managing Application Changes requires More than Just Automation)
#3 Poor Communications
There is a problem with that island. It is an accident waiting to happen.
Jurassic Park was plagued by a failure of communication and organizational checks and balances. A single programmer, Dennis Nedry, implemented the majority of the park's critical technology systems, and he deliberately sabotaged the park's systems for his own gain. Controls for information system were not shared, no backups were made and they lacked succession planning. When inquired if the system was working well and was absent of problems, the technology manager responded "We've got endless problems here." Minor problems created larger ones.
In a recent real world outage, Thorsten Von Eicken noted in regards to Amazon Web Services, "In my opinion the biggest failure in this event was Amazon's communication, or rather lack thereof. The status updates were far too vague to be of much use and there was no background information whatsoever. Neither the official AWS blog nor Werner Vogels' blog had any post whatsoever 4 days after the outage!"
#4 No Disaster Recovery
If I understand correctly, all the system will come back on their original start-up modes, correct?
At Jurassic Park the possibility of a system outage had not been fully tested, with no apparent disaster recovery plan. The park lacked proper maintenance procedures, and ran an on-the-fly with "calculated risks" – based on "theoretic" outcomes. Once Nedry had run his diversionary program, systems began to fail. Electric fences were no longer electrified, security doors no longer worked, phones were dead, as well as countless other things. With no business contingency and disaster recovery planning, the only option was to have everyone go and take the shuttle to the dock.
In addition to all the daily pressures that IT operations faces, disaster recovery is an ongoing effort that must be maintained, and often times is only really tested when it is too late. With the high volume of configuration changes coming at a dynamic pace, when changes on the production side don't make their way immediately to disaster recovery systems, this puts the systems in disparity, further complicating a timely recovery, and leaving operations vulnerable. IT Operations should be prepared, so to ensure effective and timely recovery when disaster occurs, IT teams must validate the consistency of both production and disaster recovery environments. This will ensure that all changes in production are mirrored by their disaster recovery environment. They must also constantly monitor and analyze for drifts that occur over time like in configuration tuning, maintenance, patches, releases etc. (Lessons Learned for Disaster Recovery)
Genetic power is the most awesome force ever seen on this planet. But you wield it like a kid who's found his dad's gun.
The operations people at Jurassic Park got lazy and laid back and forgot to do certain things, getting sloppy in how they went about carrying out tasks. An IT organizational culture that lacks code and a design review process, and where no other employee understands how the code works or gives input on critical design decisions, is a recipe for disaster. With the death of Nedry, no one was able to immediately remedy the system glitches and failures.
Proactive change management takes the complacency out of the process. We saw this at a customer, when the Evolven IT Operations Analytics application e-mailed change management reports every morning across the organization. The operations team was surprised to discover the amount of daily changes rapidly decreased. Following a quick investigation into the issue, they found that developers, infrastructure administrators and other application stack stakeholders simply avoided unnecessary changes since with Evolven's analytical capabilities they would become visible and raise questions. (Manufacturing Happiness? How a Customer Improved Stability and Radically Cut Unplanned Changes)
Are you ready when the next performance incident strikes?