Most Common Change and Configuration Issues in 2011
Despite advances in IT infrastructure, we still suffer from painful downtime.
A recent Gartner study projected that "Through 2015, 80% of outages impacting mission-critical services will be caused by people and process issues, and more than 50% of those outages will be caused by change/configuration/release integration and hand-off issues." (Ronni J. Colville and George Spafford)
The facts are that change and configuration errors are common, very common. So here are some of the most common change and configuration issues that we have seen in our customer's experiences.
Add your change and configuration stories in the comments section after the article.
Connection string not updated
The production environment suddenly points to a component (e.g. database) in pre-production.
A file is deployed on a server and opened by some process so that replacement did not succeed.
Manually changed configuration parameter
A configuration parameter was changed manually in production but the change was not reported back to development. A release overrides the change causing an old issue that this change was fixing.
Missing deployment pre-requisites in production environment
Sometimes infrastructure procedures differ significantly between pre-production and production environments, particularly when managed by different teams. This results in a situation where some components go missing or differ from those in pre-production. Then the tested application does not start to work after being deployed in production.
Configuration or code is changed on the fly in PROD, is not updated in DR.
Changed file ownership
File ownership/umask in UNIX is changed, causing the program or libraries to not be accessible.
Patched JRE in 1 environment
JRE is patched in production (security update), but not also applied to QA or DEV machines.
Logging or Debugging turned on
Turned on for some investigation and then never turned off, resulting in performance degradation.
Virtual image dissonance
A new virtual host is created from a certain virtual image to add power to an existing server cluster. However the rest of the servers were created from an older version of the image. This results in either users have different experiences depending on load balancing or compatibility problems arise.
Automated Windows update turned on
The automated Windows update is turned on, pulling updates that create stability issues, impacting performance or leading to security threats.
Upgrade to new version
Certain configuration parameters get different default values when Infrastructure software is upgraded to a newer version (e.g. default password expiration in Oracle 11g).
Different network definitions
Network definitions differ between servers causes some servers to fail while others still work. For example, DNS definition evolves with the time and frequently is updated manually as the result there are discrepancies leading to inconsistent behavior. Or a server gets fixed IP when the standard is DHCP.
Changed Windows service from auto start to manual
It does run properly after the change and it does not after its' host is rebooted.
New VM on same ESX
A system admin sets up a new VM on the same ESX and reduces the amount of CPU/Memory the application's server is using. This could have a major impact on performance.
Update not consistently distributed
Load balanced application does not receive update consistently on all web and app servers
Server goes down, requiring a new one to be brought up. New server is not able to run software because of a misconfiguration.