Lessons Learned for Disaster Recovery
For everyone, the power and destruction that has been felt by Hurricane Sandy is still shocking, along with the immense recovery efforts.
As reported by Charles Babcock, Hurricane Sandy challenged NYC data centers, where IT managers found unexpected problems as the storm surge roiled backup plans. Paul Venezia has provided another perspective, looking at lessons for the data center after Hurricane Sandy, saying "You never want to say 'I told you so,' but now is a good time to bring up the need for better monitoring, backup power, and other improvements."
In addition to all the daily pressures that IT operations faces, disaster recovery is ongoing effort that must be maintained, and often times is only really tested when it is too late. What are some of the lessons learned from recent events that can be applied to disaster recovery?
Automate consistency checks
In the article, Venezia says, "A disaster like Hurricane Sandy is exactly why we have DR planning. This is exactly why we hedge our bets and pay attention to the details of our infrastructure -- thus, when a Sandy happens, we don't lose everything. Now that we've seen an event occur that's nearly unprecedented in the area, perhaps those who have been busy discounting DR planning and expense may finally see the importance of those efforts. Sometimes it takes a lot to change a person's mind."
We have explored how with the high volume of configuration changes in an organization, when changes on the production side don't make their way immediately to disaster recovery systems, this results in a disparity between systems, further complicating a timely recovery, and leaving systems vulnerable.
One step to be prepared we suggest is that "IT teams must validate the configuration consistency of both production and disaster recovery environments, ensuring that all configuration changes in production are mirrored by their disaster recovery environment and must constantly monitor for configuration drifts that occur over time."
Make Change Management a continuous process
Venezia explains that "This is not just an opportunity for those affected by the storm. It's an opportunity for anyone, anywhere. If it can happen there, it can happen here. The next emergency may not come in the form of a hurricane, but to ignore or downplay the possibility of a significant geological or meteorological event impacting critical business operations is never a good move."
We have explored how IT Operations should be prepared suggesting that "The dynamic nature of the modern data center makes new configuration management tools even more critical for disaster recovery. "To ensure effective and timely recovery when disaster occurs, IT teams must validate the consistency of both production and disaster recovery environments, ensuring that all changes in production are mirrored by their disaster recovery environment. They must also constantly monitor for drifts that occur over time. There are numerous areas where drift happens – i.e configuration tuning, maintenance, patches, releases etc. As noted, due to the complexity and high volume of changes, especially events at granular levels, only an automated solution that discovers issues to an environment wide scope can keep you ahead of the next event."
Having a reliable disaster recovery plan
Venezia added how this was a learning opportunity, that "This kind of wake-up call can help revolutionize an infrastructure. Decisions get made in the aftermath of significant catastrophic events that normally wouldn't even approach the table. While it may seem opportunistic to take advantage of a situation like this, all of that will disappear when the next event comes through, and those expenses and preparations reduce the business impact."
We looked at this area in our recent article, IT Survival Tips for Reducing Uncertainty and Preparing for the Worst, where we described how today's IT landscape makes ensuring disaster recovery measures more difficult. As we noted that, "A major challenge facing IT teams today is ensuring that they have a reliable disaster recovery plan that will allow them to emerge clean or with minimal impact when situations arise. With the complexity of IT environments and successful operations hinging on a high number of (often changing) environment components, the content and configuration of systems are vulnerable to incidents."