1 (866) 866-2320 Resources Events Blog

IT Survival Tips for Reducing Uncertainty and Preparing for the Worst


IT Survival Tips for Reducing Uncertainty and Preparing for the Worst


When uncertainty enters your infrastructure operations, specifically your application infrastructure, there is real reason for concern. A few years back, a single application's availability and functionality only relied on the platform and the network it resided on.

The last thing you want to have to do is analyze configuration files and event logs searching for a mis-configuration that slowed performance. This could end up as an IT incident, taking down mission critical applications. Where as once, even a 24 hour outage would be considered by companies to be a problem, but not disastrous, today with 'real-time' standards of availability, even 1 hour of downtime is intolerable.

Today, with the introduction of IT initiatives like agile development, virtualization and cloud technologies as well as ongoing release management, the possibility of failure has expanded. IT teams are under increased pressure to implement, manage and make application deployment and software deployment faster while being efficient as well as maintain a more complex mix of systems and environments.

Despite the best efforts of IT organizations to take steps against failure, with numerous safeguards, a major source of today's failures can be attributed to both this complexity and simply human error.

Greater Complexity

When it comes to environment configuration, the devil is indeed in the details.

IT environments are complex. A typical environment includes thousands of different configuration parameters (for example, IBM Websphere Application Server holds over 16,000 configuration parameters alone) in which the mis-configuration or disregard of a single setting can cause an incident with major impact on the IT environment and business service.Now with the growth of virtual servers, IT organizations can maintain tens or hundreds of multi-environment stand-alone servers, each with their own myriad of mission critical applications running, usually from a single console. This complexity has made the datacenter more productive, but it does come at a cost.

An application needs to be both tested in various environments and implemented properly. Today's complexity adds more risk to implementation. For example, configuration settings only appropriate to the test environment can be carried over into production during deployment.  These potential errors can be overlooked and only discovered when performance appears to slow.

For large IT enterprises, maintaining high system availability means integrating a varied mix of technology,  resulting in greater complexity, introducing more risk.

Beyond Just Disaster Recovery Plans

Unfortunately today's IT landscape makes ensuring disaster recovery measures more difficult. Today, there are organizations who know (and some still learning) that merely having a disaster recovery plan in place is not enough. While these plans are often carefully crafted, they can fail without adequate configuration management capabilities in place. A major challenge facing IT teams today is ensuring that they have a reliable disaster recovery plan that will allow them to emerge clean or with minimal impact when situations arise. With the complexity of IT environments and successful operations hinging on a high number of (often changing) environment components, the content and configuration of systems are vulnerable to incidents.

Staying Ahead of Disaster

Considering the high volume of changes, when changes on the production side don't make their way immediately to disaster recovery systems, this results in a disparity between systems, making it more difficult to recover in a timely manner; thereby leaving your system vulnerable when failure strikes. To ensure effective and timely recovery when disaster occurs, IT teams must validate the consistency of both production and disaster recovery environments, ensuring that all changes in production are mirrored by their disaster recovery environment. They must also constantly monitor for drifts that occur over time. There are numerous areas where drift happens – i.e configuration tuning, maintenance, patches, releases etc.As noted, due to the complexity and high volume of changes, especially events at granular levels, only an automated solution that discovers issues to an environment wide scope can keep you ahead of the next outage.

Is your disaster recovery ready?

Learn more about how you can keep your complex environments performing to expectations, and see how automated  configuration management  and  change management  enhances disaster recovery.
About the Author
Alex Gutman and Martin Perlin