Objecting to storage
Thank you NoPorts for sponsoring
If the Internet is a big computer, Amazon s3 is the hard drive. So what happens when a single typo breaks the Internet's hard drive? On this episode of Fork Around and Find Out we review the s3 outage from 2017. It wasn't that long ago and yet it seems everyone has forgotten.
Please leave a review
Send us an email to let us know what you thought of this episode
The music was provided by Dave Eddy
Sponsor FAFOFM at https://fafo.fm/sponsor
Individual donations are appreciated at https://ko-fi.com/fafofm
Hey everyone. Justin Garrison here. If this is your first folk around and find out that you've been listening to,
you're in for a treat. This is a different show. This is not our typical episode.
I've been wanting to do this for a little while, and I've been inspired on this show from other folks that do podcasts,
from people we've had on the show just in general from how I like to learn about things.
And I can't do this show all the time. We're still going to keep the general format for the show.
We love talking to guests. We love bringing them on, understanding the problems they're working on, and the solutions they come up with.
That's going to be the typical show going forward. Autumn's going to be back with me in January.
But for this episode, the closing of our first year of folk around and find out, I wanted to try something a little bit special.
And if you like true crime podcasts, if you like a little more narrative in some of your podcast episodes,
this one's for you because one of my favorite things is postmortems, and I love reading a good postmortem.
So if you have any, please send them my way.
But one of the things I really love about postmortems is understanding all of the things that aren't in the actual technical review of what happened.
There's always some series of events, a technical failure that happened or something in the process that didn't go as planned.
But that extends much more than just the individual systems that we work on.
And this is my attempt at bringing some of that to you.
When I read a postmortem in my head, I try to put together the full picture of what this may have looked like for the engineer who was dealing with it,
or maybe even the group of people that caused the problem.
So I would love to hear your feedback about this episode.
There's an email me link in the show notes or my blue sky.
Just send us a message, right?
Just let us know what you thought because I want to hear more about not only what you thought of the show,
but also other ideas you might have along this vein.
Again, we're not doing this every time.
This episode took me probably three times as long as our normal episodes do to plan and I just don't have that much time.
But I really wanted to put this out just to put it out in the world because I think something like this should exist.
I want it to welcome in new people to technology to understand the impact of systems and decisions made 10 years ago in a code base
and how that plays out today when something fails.
The music was all created by Dave Eddy, a friend of the show.
You may know him as the you suck at programming guy, but he also does some great music links in the show notes for Christmas and the holidays.
There's just one gift that I asked from everyone listening to the show.
There's a link to review this in your show notes.
If you haven't already left a review on the podcast, please click one of those links and leave a review.
We'll put a couple of them in there for various platforms that have review systems and all of them help because it just helps someone else find the show.
And it helps someone else understand what we're trying to do here with welcoming more people into technology to understand that these are humans behind the scene.
That these systems are built by real people and that our decisions in a code base have real outcomes to other people's lives.
Any of the timestamps in this episode are in UTC, the one true time zone, and I tried to stick to the facts as much as possible.
But there are things that we just don't know.
Even though I worked at Amazon, I have no special insights into this outage.
I wasn't working there at the time.
We hope you enjoy it. Have a happy holidays and a wonderful new year.
We'll see you in 2026.
It's an amazing thing, the power and influence and importance of these problems and scary.
Some more players.
And if you had trouble getting on some internet sites yesterday, here's why.
Amazon's powerful cloud service went down for about four hours from roughly 12.30 to 5.00 p.m.
Thousands of internet services and media outlets were affected.
Amazon has not said what caused the outage, Maria.
It's cold, but not unusually cold.
The temperature was close to freezing, but was starting to warm up by the time they arrived at work.
February 28th, 2017 was an extremely unremarkable Tuesday.
It was the last day of the short month of February, but the majority of the week was still ahead.
The AWS billing system was having a problem.
But not a SEV-1 that required immediate attention or an off-hours page.
So when the engineer arrived at work, they knew they'd have to look into it.
But these types of bugs were notoriously difficult to troubleshoot.
It wasn't that billing was broken. That would have required a SEV-1 page.
But things were slow.
The worst kind of bug.
They had a hunch where the problem might be, not because they were completely familiar with how the billing system worked,
but because they had seen an increase in alerts and slowness on systems billing relied on.
And they had tooling and runbooks to cycle those systems and hope the slowness would go away,
or maybe the team asking for a fix would be satisfied with an attempt.
So after a morning round of office chatter, a stand-up letting people know they'd be looking into the issue,
and a cup of office coffee, they sat down to get started.
They put the free bananas on their desk, wiggled their mouse, and touched their yubiky to authenticate into their system.
They were a handful of tools that helped them execute common runbooks.
These tools were ancient bash scripts that might as well have been written by the aliens who built the Great Pyramids.
The scripts had evolved from single-line tool wrappers into monstrosities of semi-portable bash.
These executable text files make human evolution from single-celled organisms over millions of years look quaint.
And like humans, they're pretty reliable.
Or at least they are when you give improper instructions.
Unfortunately, today the instructions were less than proper.
To reduce tool spread and make it easier to stay up to date, the scripts work on a variety of similar systems.
The billing system was the target of today's maintenance.
But like a stormtrooper's aim, it was off the mark.
The key press was so insignificant it didn't need a review.
It was done by authorized employees from an assigned office desk on company property.
It not only took down AWS production services, it took down more than a dozen high-profile Internet darlings who placed their bets on the cloud.
S3 has a terrible, horrible, no good, very bad day on this episode of Fork Around and Find Out.
Amazon breaks the Internet.
How a problem in the cloud triggered the error message sweeping across the East Coast.
The S3 server farm started showing a high rate of errors.
It's not clear yet what caused all these problems.
Cloud computing service, which went down for more than four hours Tuesday.
Welcome to Fork Around and Find Out, the podcast about building, running, and maintaining software and systems.
Lots of people working in tech remember this day.
Maybe not exactly what happened, but we remember how we felt.
We might have been cloud champions at our companies, and a major four-hour outage was not going to help our case.
Especially because Amazon wouldn't even admit it.
The AWS cloud account on Twitter said,
We are continuing to experience high error rates with S3 in US East One, which is impacting some other AWS services.
They couldn't even say they were having an outage.
Cloud naysayers were warning about dependencies that were already creeping into applications.
This outage was the I told you so moment they needed to convince senior leadership that budget of server refreshes was a good thing to approve.
The simple fact that Amazon status page went down with the outage and updates could only be found via the AWS Twitter account
didn't reduce the cloud skepticism from late adopters.
How could a single service outage in a single region have such a global impact?
We need to start with what it is before we can talk about what happened.
Sara Jones on Twitter at one Sara Jones said,
Five hours ago, I'd never heard of AWS S3, and yet it has ruined my entire day.
Amazon's S3 server is responsible for providing cloud services to about 150,000 companies around the world.
If you're a longtime listener of this podcast, I'm sure you know what S3 is.
But stay with me for a minute because we're going to go a bit deeper.
S3 stands for Simple Storage Service, and it was one of the groundbreaking innovations of the cloud.
Before S3, there were only two types of storage commonly deployed.
There were blocks and there were files.
Blocks were a necessary evil.
They're the storage low level systems need to have access to bits.
Your operating system needs blocks, but your application usually doesn't.
Applications generally need files.
Files are great until they're not.
Files have actions applications can perform like read, write, and execute,
but they also have pesky things like ownership, locks, hierarchy,
and requirements to have some form of system to manage the files.
Call it a file system.
This was almost always something locally accessible to the application,
or at least something that appeared local.
NFS, or the network file system, was pivotal for scaling file storage
by trading network latency for large storage systems.
Pulling a bunch of computers to appear like one large resource wasn't new,
but usually they were limited to POSIX constraints
because the systems accessing them had to pretend they were local resources.
Databases were another option for removing the need of files.
Store all the stateful stuff somewhere else.
Remote connectivity and advanced query syntax sounds like a great option,
but those queries aren't always the easiest thing to work with.
Connection pooling becomes a problem,
and you need to understand what data you have and how it's going to be used
before you really start storing it.
Object storage didn't have those trade-offs.
Object storage doesn't have a hierarchy.
It doesn't need a file system, and it's definitely not POSIX constrained.
There's no SQL queries or connection pooling, just verbs like get and put.
It's one of the few trillion-dollar ideas that the cloud has created.
Well, it didn't create the idea of object storage,
but it definitely made it a viable option for millions of users around the world,
and S3 is the standard for object storage.
And on February 28, 2017, it broke.
AtnetGarun on Twitter said,
Happy AWS Appreciation Day, Internet.
S3 isn't just another AWS service.
It's a foundational piece of the entire AWS empire.
You can usually tell how important services are to AWS
by how many nines they assign to their availability SLAs.
Route 53 is the only AWS service that guarantees a 100% SLA.
As a matter of fact, I think it's the only SaaS that exists with 100% SLA.
It's pretty important.
There's only a handful of Amazon's 200 services that have four nines of availability,
and S3 is one of them.
This level guarantees only four minutes of downtime per month,
or 48 minutes for the entire year.
You can't finish a single episode of Stranger Things
in the amount of time S3 would get for downtime in a year.
S3's availability SLA is often confused with its durability SLA of 11 nines,
a ridiculous number to ponder,
and an odd form of SREP cocking that somehow sounds more impressive than 100%.
But even four nines of availability takes more than a few servers and a load balancer to achieve.
The backend of a storage system that stores this much data this reliably has a lot of components.
The service obviously has a web front-end.
It has authentication, and there's millions of disks to store the data.
But we're not going to focus on those parts.
But we are going to look at how S3 gets and puts objects into the system.
Core idea of making S3 scalable is sharding data across many different hard drives.
Amazon does this with a lot of different services,
and it calls their method of sharding data shuffle sharding.
It doesn't make it very difficult for us to scale,
because we have to think about what if customers become unbalanced over time.
So we do something a little bit different on S3.
And it's actually an example of a pattern that we talk about often in AWS called shuffle sharding.
The idea of shuffle sharding is actually pretty simple.
It's that rather than ecstatically assigning a workload to a drive
or to any other kind of resource, a CPU, a GPU, what have you,
we randomly spread workloads across the fleet, across our drives.
So when you do a put, we'll pick a random set of drives to put those bytes on.
Maybe we pick these two.
The next time you do a put, even to the same bucket, even to the same key,
it doesn't even matter, right?
We'll pick a different set of drives.
They might overlap, but we'll make a new random choice
of which drives to use for that put.
Shuffle sharding is basically a predictable way for you to assign resources
to a customer and spread those resources
so two customers don't keep getting grouped together.
If Disney, Netflix, HBO, and Peacock are all customers
and they each get two servers, shuffle sharding would make sure
that Disney and Netflix might be co-located on one server,
but the second allocated server should be Netflix and HBO.
This allows for things like the Disney and Netflix server to go down,
but Netflix still has some availability.
Or if HBO is a noisy neighbor and consumes all the resources on the second server,
Netflix still has a server that's not overloaded.
This spreads the load between customers.
Assuming those customers don't buy each other in multi-billion dollar acquisitions,
but that's not really Amazon's problem.
In order to do this for large storage systems,
you need to have services that store metadata about where blobs are stored.
This service is in the critical path for putting data into the system.
Because if you can't keep track of where the data went,
you might as well just delete it.
Of course, S3 writes data in multiple places throughout multiple data centers.
The details don't matter for this outage,
but what does matter is there's a critical service part of S3
that keeps track of where all this data is stored called the placement service.
New puts go through the placement service.
All other requests, get, list, put, delete, go through the index service.
People usually put data before they get it,
but not all customers know what they're doing.
But the important thing here is that we chunk up the data that you send us
and we store it redundantly across a set of drives.
Those drives are running a piece of software called Shardstore.
So Shardstore is the file system that we run on our storage node hosts.
Shardstore is something that we've written ourselves.
There's public papers on Shardstore, which James will talk about in his section.
But at its core, it's a log-structured file system.
The internal names for S3 services are not as boring as index and placement.
They have, or at least at one time they had,
cool quirky names like R2D2, Death Star, Mr. Biggs, PMS, and Cramps.
But the postmortem would have had a lot more to explain
if they used the internal names to describe these systems.
After the break, we'll talk about what happened on February 28th, 2017.
On the morning of February 28th, an engineer went to scale down
a small set of servers used by the billing system.
They were using a runbook, which included a pre-approved script,
to manage the scale-down process.
Unfortunately, they mistyped the command.
We've all done it, and some of us have even taken down production with our mistype.
This mistype took down a lot of productions.
I envision it like a bash-grip with a couple flags
and some arguments that need to go in a certain order.
And if you change that order, it might do something different.
I don't know the details, but we can see how this would happen.
Instead of removing a few servers for the billing service,
they removed more than 50% of the servers for the index and placement services.
Amazon's service disruption report was posted two days after the outage.
It says, removing a significant portion of the capacity
caused each of these systems to require a full restart.
It's unknown why this amount of capacity required a full restart.
You would think the service would just be slow,
or only specific customers would be affected.
After all, Amazon talks about how great shuffle sharding is.
But for some reason, these services were not configured this way,
or at some level of capacity, you had to shut everything down.
They go on to say,
S3 subsystems are designed to support the removal or failure of significant capacity
with little or no customer impact.
We build our systems with the assumptions that things will occasionally fail,
and we rely on the ability to remove and replace capacity
as one of our core operational processes.
While this is an operation that we have relied on to maintain our systems,
since the launch of S3,
we have not completely restarted the index subsystem
or the placement subsystem in our larger regions for many years.
And again, this is 2017.
Many years, in my opinion, makes it more than three,
and S3 launched in 2012.
So there was only a couple years in there that they may have restarted these services completely.
But why does any of this matter?
S3 is just a storage mechanism.
Who cares?
Who cares if some files don't get put into storage and retrieved from that storage?
Well, when a service is this critical, other Amazon services get built on it.
So while S3 was down, other Amazon services also went down.
Things like EC2 instances.
You could not launch a new VM in AWS without S3.
New Lambda invocations couldn't scale up because those Lambda functions were stored in S3.
EBS volumes that needed snapshot restores, those also come from S3.
Different load balancers within AWS, S3.
There were so many services that cascaded down into failure because S3 objects were not available.
NPR called this Amazon's $150 million typo.
Because while S3 was down, it's estimated that of the top 500 S&P on the stock market,
those companies lost $150 million in value.
But that's still not the complete picture here.
So many other companies lost productivity.
This isn't just stock trading.
A bunch of companies couldn't do work.
They hired people.
They sat around for hours.
An internet monitoring company, Apica, found that over half of the top 100 online retailers slowed their performance by 20%.
Half of the biggest sellers on the internet had slow websites for the day.
And a CEO of Catchpoint estimated that the total ramifications of this one typo would be in the hundreds of billions of dollars in loss productivity.
People are sitting around at work and they can't do their work.
So many websites were down for four hours in an Amazon outage that the whole world felt the impact.
S3 may just be storing objects at an HTTP endpoint.
But so many things we rely on rely on things being available.
This is why Spotify was down and various companies were down because those music files that you stream,
when you look all the way back where they come from, they come from storage somewhere.
Like it's still a file that exists somewhere on a hard drive in a data center.
It just so happened to go through the most convoluted, complex system to fetch a file.
Yes, there are plenty of other ways to spread files across the internet.
CDNs are widely used, but CDNs are expensive.
You have to balance how much money you're paying for fast, reliable, globally replicated data versus slow things that occasionally get served up.
S3 has a pretty good pricing model to meet that middle ground of this is available and has 11 nines of durability.
So it must be good to store some things and trust that it's going to be there for a very long time.
And of course S3 has tiering systems and all this other stuff that make it more easy to just throw data ads and leave it there for who knows how long.
But when all of a sudden those files aren't available, you start realizing that all of your applications need files.
Not just your local applications. The websites you're using have files behind the scenes.
Half of Apple's iCloud, half of the services were down.
This includes issues with the App Store and iCloud backup and Apple Music and Apple TV.
All of these services store files at the end of the day.
And that's why something like S3 as an object storage is so critically important to the functioning of the internet.
When it really comes down to it, the internet is basically just a bunch of remote file servers in different ways to access those files.
Webpages are files, musics files, streaming services, all files.
And when the files go away for four hours, turns out there's not a lot of stuff that people are able to do.
And as Amazon's slowly recovering these services, they get to the point where they realize they have to turn things off.
There's too much traffic going to the placement service and the index service.
And there's just not enough servers there that even trying to scale them up isn't going to work because you can add more servers to the pool.
But it's going to be difficult for the load balancer to add them in and help check them and do other things that at some point you just kind of have to reset the whole system.
Remember back in the day with Windows XP?
This was like back in time when we used to restart our computers regularly.
At the end of the day, you would go home from work and you'd actually shut down your computer.
There was a big shift in how you used your computer when Windows Vista and Windows 7 came out and sleep kind of worked reliably.
And we didn't have to shut down our computer every day.
But in the morning when you got to work and you turned on your computer, even though the computer was on and maybe even at the login screen,
it still needed like 10 more minutes to be ready to use.
Imagine that but for a quarter of the internet.
The blast radius of S3 was something unseen before this time.
Casey Newton from The Verge wrote this article which might as well be cloud poetry.
He says,
Is it down right now?
A website that tells you when websites are down.
Is down right now?
With Is It Down Right Now Down, you will be unable to learn what other websites are down.
At least until it's back up.
At this time, it's not clear when Is It Down Right Now will be back up.
Like many websites, Is It Down Right Now has been affected by the partial failure of Amazon's S3 hosting platform.
Which is down right now.
While we can't tell you everything that is down right now.
Some things that are down right now include Trello, Quora, If This Then That, and ChurchWeb.
Which built your church's website.
For other outages, you would be able to tell that these websites were down by visiting Is It Down Right Now.
But as we mentioned earlier, Is It Down Right Now is down right now.
This post will be updated when Is It Down Right Now is up again.
Third party websites were not the only thing that went down.
From the service disruption report Amazon said,
From the beginning of this event until 737 UTC, we were unable to update the individual services status on the AWS Service Health Dashboard, SHD.
Because of a dependency, the SHD Administration Console has on Amazon S3.
Specifically, the Service Health Dashboard was running from a single S3 bucket hosted in US East One.
Amazon was using the at AWS Cloud Twitter account for updates.
The account said,
The dashboard is not changing color is related to the S3 issue.
See the banner at the top of the dashboard for updates.
Granted, this was before most of the bots and Nazis took over Twitter.
But it was still an embarrassment for AWS to have to resort to this for its announcements.
Christopher Hansen said,
Apparently, AWS is pretty important to and for the proper functioning of the web.
John Battelle said,
You never realize how important AWS is until it's down.
Unfortunately, Azure can't say the same.
Cassidy, who we will have on the podcast, said at the time,
Oh man, AWS S3 buckets are down.
Hashtag Amazon.
Let's not forget, AWS had 15 other regions in 2017.
But then, like now, US East One was more used than any other region.
Maybe even all other regions combined.
This was still pretty early cloud days for a lot of companies.
Amazon leadership would still go to large potential customers and conferences.
Remember those?
And share the amazing benefits of the cloud.
It just so happened that at the very moment,
AWS was having one of its largest outages in history.
Adrian Cockroff, a recent VP hire from Netflix,
was on stage talking about the many benefits of AWS's scale and reliability.
So what happened after the outage?
Websites all over the country were affected, about 148,000 websites.
They are putting out from a verified account and that stocks down half a percent.
They believe they understand the root cause and are working hard at repairing it.
Future updates will all be on dashboard.
So as we watch that, we will try and keep you updated on any new developments.
But it apparently is affecting millions.
It's important to note that Amazon never calls these things outages.
The official AWS cloud Twitter account called it a high rate of errors.
For the rest of us, that just means it's an outage.
It was mocked relentlessly.
But Amazon put their fingers in their ears and said la la la until things blew over.
Loot Ventures called the disruption a temporary black eye for Amazon.
Customers would not go through the hassle of switching to a competing cloud service
because of a one-time event, he said.
Amazon chose to ignore this outage.
Like many outages before it, and many that have come since.
Beyond the couple days it had in the spotlight,
there were lots of private apologies to customers,
a bunch of bill reimbursements for SLAs that were broken, and that's about it.
Two months later, during Amazon's quarterly earnings report,
it didn't even mention it.
As a matter of fact, Amazon stock barely noticed.
A small dip the day it happened, and then a continued march to record growth.
What a big player Amazon is on the internet with their cloud services.
I think like a third of the world's cloud services is operated by Amazon.
They didn't break the internet, but they certainly brought it to a slow,
not a standstill yesterday, but a large number of people were affected.
So in some perverse way, we see the stock moving higher today.
Maybe this is a recognition of, wow, their web services is really big.
Yes, and how big a company they are, how important they are on the internet.
I think it's a tremendous amount of revenue.
It's not just the online buying site.
This web services division is huge.
Amazon said in the disruption report,
we are making several changes as a result to this operational event.
While removal of capacity is a key operational practice,
in this instance, the tool used allowed too much capacity to be removed too quickly.
We have modified this tool to remove capacity more slowly
and added safeguards to prevent capacity from being removed
when it will take any subsystem below its minimum required capacity.
Meaning, in the past when someone ran this bash script,
it would let you scale to zero if you wanted.
The script trusted that the human had the context to know what they were doing
and not make any mistakes.
But we all know that's not really how this works.
This is an important lesson for a lot of people to learn.
Just because a mistake was made doesn't mean there's blame.
This script allowed the command to be run.
Was the mistake caused by the person who it entered
or the person who committed the code?
The answer is neither.
There isn't a person to blame.
There's a system and there's consequences.
Everyone is responsible for the safety of people and systems
and failures are only root cause to the point of when a measurement hits a threshold.
But the events that led up to that threshold being agreed upon
or that measurement being tracked at all play a role.
The entire system of people and processes are to blame when things go bad
and are to be praised when they function as intended.
Amazon further says we are also auditing our other operational tools
to ensure we have similar safety checks.
We will also make changes to improve the recovery time of the key S3 subsystems.
We employ multiple techniques to allow our services to recover from any failure quickly.
One of the most important involves breaking services into small partitions
which we call cells.
We could further reduce the blast radius by creating smaller boundaries
or cells and having a copy of our application in each of those boundaries.
Now what we have done here if there's an issue that happens
it will only be isolated within this boundary or within this cell
reducing the blast radius and containing the failure to be only within a defined boundary.
Cells are a term to make you feel bad about your architecture.
To make the cloud seem cooler than your data center
and to pretend an architectural improvement will prevent the systemic failure of leadership.
If you enjoyed this episode and want to hear more please let us know.
We'll have our regular interviews starting up again in 2026.
Until then have a happy holiday and may your pagers stay silent.
Thank you for listening to this episode of Fork Around and Find Out.
If you like this show please consider sharing it with a friend, a co-worker, a family member or even an enemy.
However we get the word out about this show helps it to become sustainable for the long term.
If you want to sponsor this show please go to fafo.fm slash sponsor
and reach out to us there about what you're interested in sponsoring and how we can help.
We hope your system stay available and your pagers stay quiet.
We'll see you again next time.
Thank you.