1 (866) 447-2526 Resources Events Blog

AWS Outage that Broke the Internet Caused by Unintended Change

Blog

AWS Outage that Broke the Internet Caused by Unintended Change


 

Last week, many faced error screens and slow moving websites, struggling to use some of their favorite apps and on-line products, when Amazon’s Web Service (AWS) failed. The long list of companies affected by the AWS outage included Adobe, Atlassian, Business Insider, Docker, Expedia, GitLab, Coursera, Medium, Quora, Slack, Twilio, and the US Securities and Exchange Commission, among many others.

According to a Business Insider report, 54 of the top 100 online commerce sites were taken offline.

E-commerce site response came to a crawl with sites such as the Disney Store taking 1165% longer to load than usual. The Target site, according to BI, took 991% longer.

The problems came from trouble with Amazon’s S3 cloud storage service outage, which Amazon said experienced “high error rates,” particularly on the East Coast  due to a partial failure at one of Amazon Web Service's data centers.

Ultimately Amazon revealed that an AWS tech member entered a command with a typo in it. Instead of taking a just a few cloud servers offline for maintenance, this took down an entire AWS data center in Virginia, causing the outage.

This wasn’t just inconvenient, but costly "During AWS' four-hour disruption, S&P 500 companies lost $150 million, according to analysis by Cyence, a startup that models the economic impact of cyberrisk. US financial services companies lost an estimate $160 million, the company estimates." (Business Insider)

Amazon's reputation was further criticized by  many sharing bitter commentary about the situation on Twitter. Here are just  a few examples:

Risk of Unintended Changes

The cloud is not supposed to be a place where unauthorized changes occur. Yet despite planning and preparation, unauthorized changes still occur, and they represent serious threats to environment stability, security and compliance. While it’s well known that unauthorized or unknown changes are the true root cause of most stability issues, IT still struggles to know what actually changed when trouble starts.

Organizations may be implementing new policies and processes around their environment operations, but they still lack the ability to enforce them due to a lack of detailed data. In many cases, it is not apparent whether or not those processes and practices are even working.  However, this also limits agility and flexibility. Over time those same processes will likely be relaxed, so that things can actually get put into place for work to get done.

Detecting unauthorized or unintended changes early on is critical for preventing incidents, accelerating their resolution and mitigating security and compliance risk.

What Caused the Outage with AWS?

According to Amazon, At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process.

Because he typed a parameter incorrectly, the command ended up taking larger set of systems down. While this error originally affected only the S3 billing process, it ended up cascading the problem with several other subsystems that depend on that process.

While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.

Change Centric Analytic Solution Can Help

In the Amazon scenario, had Evolven been in place, it would have been monitoring several things that were related to this issue.

Evolven uses patented analytics and machine learning to detect changes carried out in IT environments and then prioritize the changes that pose the greatest risk to performance and stability. So with Evolven, you would finally know all actual changes carried out across the entire environment (automated and manual, planned and unplanned, authorized and unauthorized).

Applying this to Amazon, Evolven would have been able to detect a workload anomaly (storage decreasing at a pace that is atypical) and probably would have picked up on the configurations that were changed as a result of the commands entered erroneously. So by correlating a lot of information, Evolven could analyze and quickly point the IT teams in the right direction – especially when no one immediately knew what was going wrong.

While most organizations aren't nearly as automated as Amazon, when you take into consideration that a single engineer was basically able to take down a substantial portion of the internet (yes THE INTERNET) it is impossible not to reach the conclusion that  the IT staff at other organizations could make similar mistakes.

Evolven can provide the capabilities that will lessen the need for engineers to access systems directly. Any time you remove the human element, then you reduce the possibility of situations like what Amazon experienced.

This is really a case of "if it happened to Amazon it can happen to anyone".

See Evolven in action!
Unlock the power of actual changes. Register now for a live demo.

 

About the Author
Martin Perlin