
A massive AWS outage Monday that brought down some of the world’s most popular apps and services all started with a glitch.
The bug – which occurred when two automated systems were trying to update the same data simultaneously – snowballed into something significantly more serious that Amazon’s engineers scrambled to fix, the company said Thursday in a postmortem assessment.
The massive cloud service’s outage meant people couldn’t order food, communicate with hospital networks, access mobile banking, or connect with their security systems and smart home devices. Major global companies, including Netflix, Starbucks and United Airlines, were temporarily unable to give customers access to their online services.
“We apologize for the impact this event caused our customers,” Amazon said in a statement on the AWS website. “We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”
At a high level, the issue stemmed from two programs competing to write the same DNS entry – essentially a record in the internet’s phonebook – at the same time, which resulted in an empty entry. That threw multiple AWS services into disarray.
“The analogy of a telephone book is pretty apt in that the folks on the other line are there, but if you don’t know how to reach them, then you have a problem,” Angelique Medina, head of Cisco’s ThousandEyes Internet Intelligence network monitoring service, told CNN. “And that telephone book effectively went poof.”
Indranil Gupta, a professor of electrical and computing engineering at the University of Illinois, used a classroom analogy to explain Amazon’s technical analysis in an email to CNN. Say two students, one who is a fast worker, the other who is a slower worker, are asked to collaborate on a shared notebook.
The slower student “pays attention in brief bursts, but their work may conflict or contradict the work of the faster student,” he wrote. At the same time, the quicker student may be “trying to constantly ‘fix’ things quickly” and delete the slower student’s work because it’s outdated.
“The result… an empty page (or crossed out page) in the lab notebook, when the teacher comes and inspects it,” he wrote.
That “empty page” brought down AWS’ DynamoDB database, creating a cascading effect that impacted other AWS services like EC2, which offers virtual servers for developing and deploying apps, and Network Load Balancer, which manages demands across the network. When DynamoDB came back online, EC2 tried to bring all of its servers back online at once and couldn’t keep up.
Amazon is making a number of changes to its systems following the outage, including fixing the “race condition scenario,” which caused the two systems to overwrite each others’ work in the first place, and adding an additional test suite for its EC2 service.
Outages like Monday’s, while rare, are just a reality, Gupta said. But what matters is how such issues are addressed.
“Large scale outages like this, they just happen. There’s nothing you can do to avoid it, just like (how) people get ill,” Gupta told CNN over the phone. “But I think how the company reacts to the outages and keeps customers informed is really, really key.”
 
				


