A bug in DynamoDB automation was the cause of the last big AWS crash

Amazon has published a fairly detailed report on the massive interruption of its services (AWS) that took place on October 20. The crash, which left numerous web pages, applications and games inaccessible, had its origin in a bug in your software DynamoDB automation platform, where your AWS clients store data. Remember that Amazon’s cloud services power much of the Internet, with its NVIDIA H100 GPUs and custom AI chips made by Intel.

The report explains that this initial failure triggered chain problems in other Amazon Web Services systems that depend on said software. DynamoDB, which manages hundreds of thousands of records DNS is designed to solve any incident automatically. But it seems that this was not the case.

A chain failure that required manual intervention

According to the company, an error in DynamoDB’s DNS management system caused it to be generated an empty DNS record for Amazon data centers in Northern Virginia. The problem was exacerbated when the automatic repair mechanism, which should have corrected this error, failed. This forced Amazon staff to have to fix the problem manually. While the issue was active, all systems that needed to connect to DynamoDB were unable to do so and experienced DNS failures.

The list of affected services was extensive and included Amazon itself, its Alexa devices, Bank of America, Reddit, Canva, Snapchat, Disney+, Apple Music, Fortnite, PlayStation and Lyft, among many others. As stated in Engadgeteven affected connected products, such as smart beds, which connect to the Internet to adjust their temperature. Some of these services were slow to respond, while others were completely inaccessible. Without a doubt, a painful reminder of the dependence of these giants on cloud.