Amazon Web Services Outage Reveals Critical Lack of Redundancy Across the Internet

Nat Levy | Geek Wire | February 28, 2017

The digital snow day is over, as Amazon Web Services has fixed the issues with its Simple Storage Service, or S3 for short, that crippled significant chunks of the internet Tuesday. Starting a little after 9:30 a.m. Pacific time Tuesday, and lasting close to five hours, the S3 cloud storage service started experiencing “high error rates.” This outage knocked out access to a litany of websites and apps that run on AWS, including but not limited to Expedia, Slack, Medium, the U.S. Securities and Exchange Commission. The outage even temporarily affected the AWS service health dashboard, which displays outages and events.

Nick Kephart, senior director of product marketing for ThousandEyes.
Amazon has not fully detailed what caused the high error rates. Nick Kephart, senior director of product marketing for San Francisco-based network intelligence company ThousandEyes, monitored the outage throughout the day. He said information could get into Amazon’s overall network, but attempting to establish a network connection with the S3 servers was like hitting a wall. It stopped all traffic dead in its tracks. So any site or app that hosted data, images or other information on S3 was affected.

Without having access to Amazon’s servers, Kephart couldn’t say why it became impossible to connect with the S3 servers. He said it isn’t clear if it was a human error, or infrastructure failure, or a configuration problem or an automation issue that caused the problem. But he theorized it was a pretty complicated malfunction given the proliferation of the outage. “It wasn’t just the system completely misbehaving but something deeper in the infrastructure that caused these problems,” Kephart said. ThousandEyes also produced this visualization to show the extent of the outage and all the interactions within the AWS network...