Amazon’s Simple Storage Service (S3) outage on Feb. 28 took down many well-known websites and web services. For the complete post-mortem from Amazon Web Services (AWS), read this lengthy explanation of what went wrong and what AWS is doing to address the issue.
If the full explanation too long and too complicated, here is a short version:
- An administrator was going to perform maintenance on a set of S3 servers.
- He mis-typed the command to take a set of servers offline, and more servers than intended were taken off line
- This took the entire S3 environment in the U.S. East Zone closer to the edge in capacity than the system was designed for and caused widespread availability issues in web services that relied upon the S3 environment.
More instructive and more worrisome are the steps Amazon took to prevent this issue from happening again: