2086 words

10 minutes

AWS Keeps Crashing and Everyone Blames AI—But That's Not the Real Story

2025-11-03

Opinion

AWS

/

Cloud Infrastructure

/

Business Continuity

Hero image for AWS Keeps Crashing and Everyone Blames AI—But That's Not the Real Story

Athena Character @ openart.ai

WARNING
Hot take incoming: The problem isn’t AI. It’s you.

Last week, AWS us-east-1 went down for 16 hours and took half the internet with it. Fortnite died. Banks went offline. Social media imploded. And the hot takes? Oh, the hot takes were chef’s kiss. Everyone’s pointing fingers at Amazon for replacing engineers with AI, and yeah, that’s probably happening. But can we talk about the elephant in the data center?

You’re still running production workloads in us-east-1.

After years of this region face-planting more often than a drunk giraffe, you’re shocked—SHOCKED—that it happened again. AWS literally uses us-east-1 as their testing ground for new features. It’s the biggest region, has the most availability zones, and gets the most outages. This isn’t a secret. This is documented, well-known, “everyone in the industry nods knowingly” common knowledge.

And yet here we are. Again. ¹

Let me be crystal clear about us-east-1’s track record: November 2020 (Kinesis failure), December 2021 (major service event), June 2023 (Lambda issues), July 2024 (Kinesis Data Streams), and now October 2025 (DynamoDB DNS catastrophe). This isn’t bad luck. This is a pattern. And you’re still surprised?

The AI Blame Game Is Missing the Point#

(Yes, Amazon probably fired too many engineers. No, that’s not why you’re offline.)

Will Lockett’s piece on Amazon’s AI obsession is spot-on about one thing: replacing critical infrastructure engineers with hallucinating chatbots is a spectacularly bad idea. The “Just Walk Out” grocery store disaster should’ve been a warning. But blaming this outage entirely on AI is like blaming your hangover on the last shot when you drank the whole bottle.

The October 20, 2025 outage? That was a DNS race condition in DynamoDB that cascaded for 6 hours, followed by another disruption on October 29. ² The root cause was a “latent defect” where two automated systems tried to update the same DNS data simultaneously, creating an empty DNS record. This prevented applications from resolving API names to IP addresses, taking down 140 AWS services including DynamoDB, EC2, and S3. The estimated damage? Between $38 million and $581 million in insured losses alone, with some analysts suggesting the global economic impact could hit hundreds of billions. ³

Could understaffing have made it worse? Absolutely. But the root cause isn’t AI—it’s architectural fragility and your refusal to architect for failure.

Here’s what nobody wants to admit: AWS Well-Architected Framework tells you to go multi-region for critical apps. ⁴ You know why most companies don’t? Because multi-region deployments are expensive as hell, especially after the latest cloud price hikes. So instead, you gamble on us-east-1 staying up and act surprised when the house wins.

Let’s Talk About GCP (The Uncomfortable Comparison)#

(Spoiler: The grass isn’t greener, it’s just differently brown)

“But Athena,” you’re thinking, “should we just switch to Google Cloud?” Oh honey. Let me show you something that’ll make you reconsider:

Cloud Provider Incidents (2024-2025)

GCP (78 incidents) (49.0%)

AWS (38 incidents) (24.0%)

Azure (9 incidents) (27.0%)

Here’s the plot twist nobody’s talking about: GCP had 78 incidents in 2024-2025 compared to AWS’s 38. But wait, it gets better (or worse, depending on your perspective):

Cloud Provider	Total Incidents	Avg Duration	Total Downtime	Verdict
AWS	38	1.5 hours	57 hours	Fast recovery, but recent big ones hurt
GCP	78	5.8 hours	452 hours	More frequent AND longer outages
Azure	9	14.6 hours	131 hours	Rare but brutal (one 50-hour nightmare)

So AWS recovers faster when things break, but they’ve been breaking more spectacularly lately. GCP breaks more often AND takes longer to fix. Azure barely breaks but when it does, grab some popcorn because you’re in for a show. ⁵

According to Parametrix’s 2024 Cloud Outage Risk Report, Google Cloud experienced a 57% increase in downtime hours compared to 2023, while AWS remained statistically the most reliable despite those headline-grabbing October outages. ⁶

And yes, GCP had its own spectacular meltdown in June 2025 that crashed Gmail, Spotify, and a bunch of other services globally. The culprit? Automation failure. Sound familiar? ⁷

So no, GCP isn’t your savior. They’re dealing with the same scaling pressures, the same AI infrastructure boom, and the same human error problems (which caused 68% of all cloud outages in 2024, by the way).

The Real Scandal: You Know Better and Do It Anyway#

(herp durp sandwich time…)

herp durp sandwich time

Here’s the part that makes me want to flip tables: you know us-east-1 is unreliable. Every architect knows. Every DevOps engineer knows. Every CTO who’s been through one of these outages knows. And yet, when I look at where production workloads live, it’s still overwhelmingly us-east-1!!

And before you say “but GCP is better,” let me stop you right there. GCP had 78 incidents in 2024-2025 versus AWS’s 38. Their average outage lasts 5.8 hours compared to AWS’s 1.5 hours. The math isn’t mathing in GCP’s favor, folks.

Why? Let me guess:

“It’s cheaper.” Until it’s not. Calculate the cost of 16 hours offline and get back to me.
“We need low latency for East Coast users.” Cool. Ever heard of us-east-2? It’s right there. In Ohio. Still on the East Coast.
“Multi-region is too complex.” Translation: “We don’t want to do the work.”
“Our customers are in that region.” Some are. Not all. And guess what? They’d rather have slightly higher latency than no service at all.

The AWS Well-Architected Framework literally has an entire pillar on reliability. ⁴ It tells you to design for failure, use multiple availability zones, and go multi-region for critical workloads. But that costs money and requires actual engineering effort, so instead we get this performative outrage every time us-east-1 goes down.

The Cost of “Good Enough” Architecture#

(Math time, but I promise it’s not painful math)

Let’s do some napkin math on the October 2025 outages:

2,000+ businesses impacted (conservative estimate)
Insured losses alone: $38-581 million
Global economic impact: Potentially hundreds of billions when factoring in lost sales, idle workers, and reputational damage ³
Services affected: 140 AWS services went down, including major platforms like Fortnite, Roblox, Snapchat, Reddit, Slack, Coinbase, Robinhood, and even Amazon’s own services

Now compare that to the cost of proper multi-region architecture:

Data transfer costs: 2-5% increase in cloud spend
Engineering time: One-time investment, ongoing maintenance
Complexity: Higher, but manageable with modern tools

The ROI on resilience is staring you in the face, and you’re still choosing to roll the dice on us-east-1 because the upfront cost feels scary. Want to get better at calculating these trade-offs? Our guide on business analytics can help you build the case for infrastructure investment.

What About the AI Angle? (Fine, Let’s Talk About It)#

(Because you won’t let it go)

Yes, Amazon is probably replacing engineers with AI. Yes, that’s probably making outages worse. But let’s be honest about what AI can and can’t do in infrastructure:

AI is terrible at:

Novel problem-solving under pressure
Understanding cascading failures
Making judgment calls with incomplete information
Fixing things it’s never seen before

AI is okay at:

Pattern recognition in logs
Automated remediation of known issues
Capacity planning
Routine maintenance tasks

The problem isn’t that AWS is using AI—it’s that they’re (allegedly) using it to replace the humans who handle the hard stuff. That’s like replacing your surgeon with WebMD and wondering why the operation went sideways. (If you want the full breakdown on why AI can’t replace critical workers, check out our deep dive on AI productivity theater.)

But here’s the thing: even if AWS had a full staff of brilliant engineers, us-east-1 would still be the most outage-prone region because of its size, age, and role as AWS’s testing ground. The AI staffing issue is real, but it’s not the root cause of your downtime.

So What Do You Actually Do?#

(The part where I stop being mean and start being helpful)

Athena Character @ openart.ai

Look, I’m not saying abandon AWS. Despite the outages, they’re still statistically the most reliable major cloud provider. But for the love of all that is holy, stop pretending us-east-1 is production-ready for critical workloads.

Here’s your actual action plan:

Short-term (Do This Yesterday):

Audit what’s actually running in us-east-1
Move critical workloads to us-east-2, us-west-2, or literally anywhere else
Set up proper monitoring and alerting for regional health
Have a documented failover plan (and test it)

Medium-term (Next Quarter):

Implement multi-AZ for everything that matters
Start planning multi-region for revenue-critical services
Calculate your actual cost of downtime vs. cost of resilience (our time management strategies apply to infrastructure planning too)
Stop lying to yourself about “acceptable risk”

Long-term (This Year):

Full multi-region deployment for critical paths
Automated failover testing (chaos engineering, anyone?)
Regular disaster recovery drills
Accept that resilience costs money and budget for it

The “But It’s Expensive” Objection:

Yeah, multi-region costs more. You know what costs more than that? Explaining to your board why you lost millions in revenue because you were too cheap to architect for known failure modes. Pick your expensive.

The Uncomfortable Truth#

(This is where I lose half my readers)

The real story here isn’t “AI bad” or “AWS unreliable.” It’s that we’ve collectively decided that architectural shortcuts are acceptable until they’re not, and then we act shocked when the bill comes due.

AWS us-east-1 is going to keep having outages. That’s not speculation—it’s pattern recognition. The region is too big, too old, and too critical to AWS’s testing pipeline to ever be truly stable. And Amazon’s staffing decisions, whether AI-driven or not, aren’t going to change that fundamental reality.

You can keep rolling the dice and writing angry Medium posts every time it happens. Or you can accept that resilience costs money, do the work, and sleep better at night.

Your call.

Comment Bait (Tell Me I’m Wrong)#

(I dare you)

Are you still running production in us-east-1? Why?
What’s your actual cost of downtime vs. cost of multi-region?
Has anyone successfully migrated away and lived to tell the tale?
Am I being too harsh, or not harsh enough?

Drop your war stories below. Bonus points if you admit you’re still in us-east-1 and have no plans to leave.

NOTE
Take away this: AWS us-east-1 is going to keep failing. AI might be making it worse, but the real problem is your refusal to architect for known failure modes. Multi-region costs money. Downtime costs more. Do the math.

FAQs About Cloud Reliability#

Q: Is AWS really that unreliable?

A: No—AWS is actually the most reliable of the big three cloud providers. In 2024-2025, AWS had 38 incidents with an average duration of 1.5 hours. Compare that to GCP’s 78 incidents averaging 5.8 hours, or Azure’s 9 incidents averaging 14.6 hours. But us-east-1 specifically has more outages than other AWS regions because it’s the largest, oldest, and serves as AWS’s testing ground for new features. It’s not that AWS is bad; it’s that you’re choosing their most volatile region for critical workloads.

Q: Should I switch to Google Cloud or Azure?

A: Probably not. GCP had 78 incidents in 2024-2025 (vs AWS’s 38) with an average outage duration of 5.8 hours (vs AWS’s 1.5 hours). Azure has fewer incidents but when they go down, they stay down—14.6 hour average. The problem isn’t the provider—it’s single-region architecture. If you’re not willing to do multi-region on AWS, you won’t do it on GCP either, and you’ll just trade one set of outages for another.

Q: How much does multi-region actually cost?

A: Typically 2-5% more in cloud spend due to data transfer costs, plus engineering time for setup and maintenance. Compare that to the average $300,000/hour cost of downtime for large enterprises. For most businesses with revenue-critical services, multi-region pays for itself after one avoided outage.

Q: What’s wrong with us-east-1 specifically?

A: It’s AWS’s largest and oldest region, has the most availability zones, and serves as the testing ground for new AWS features. This makes it inherently more prone to outages. The region has a documented history of major failures: November 2020 (Kinesis), December 2021 (Service Event), June 2023 (Lambda), July 2024 (Kinesis Data Streams), and now October 2025 (DynamoDB). It’s not a secret—AWS engineers will tell you this off the record. The region is too critical to AWS’s operations to ever be truly stable.

Q: Is Amazon really replacing engineers with AI?

A: Amazon announced workforce reductions in AWS while simultaneously expanding AI capabilities, so… probably? But even if they weren’t, us-east-1 would still have more outages than other regions. The AI staffing issue might be making things worse, but it’s not the root cause of regional instability.

Q: What’s the minimum viable multi-region setup?

A: Active-passive failover to a second region for critical services. Your primary region handles all traffic, but you have a warm standby that can take over if the primary fails. It’s not perfect, but it’s infinitely better than single-region and costs way less than active-active. Start there and iterate.

Amazon Just Proved AI Isn’t The Answer Yet Again — Will Lockett on Medium ↩
AWS US-EAST-1 Outage: Root Cause Analysis — October 19-20, 2025 ↩
The Cloud Crumbles: Recurring AWS Outages Expose Fragility of Digital Infrastructure — Chronicle Journal Markets ↩ ↩²
AWS Well-Architected Framework — Reliability Pillar ↩ ↩²
Cloud Outages Research — Cherry Servers Analysis (Aug 2024-Aug 2025) — AWS (38 incidents, 1.5hr avg), GCP (78 incidents, 5.8hr avg), Azure (9 incidents, 14.6hr avg) ↩
Cloud Outages: A 2024 Analysis — Parametrix Cloud Outage Risk Report ↩
Google Cloud Outage Analysis: June 12, 2025 — ThousandEyes ↩

AWS Keeps Crashing and Everyone Blames AI—But That's Not the Real Story

https://wayfinder.page/posts/aws-ai-outage/

Author

Athena

Published at

2025-11-03

License

CC BY-NC-SA 4.0

Your 'Autonomous' Agent Needs Babysitting

The AI Productivity 'Revolution' Is Mostly Theater

1

The AI Blame Game Is Missing the Point

2

Let’s Talk About GCP (The Uncomfortable Comparison)

Cloud Provider Incidents (2024-2025)

3

The Real Scandal: You Know Better and Do It Anyway

4

The Cost of “Good Enough” Architecture

5

What About the AI Angle? (Fine, Let’s Talk About It)

6

So What Do You Actually Do?

7

The Uncomfortable Truth

8

Comment Bait (Tell Me I’m Wrong)

9

FAQs About Cloud Reliability

10

Footnotes

AWS Keeps Crashing and Everyone Blames AI—But That's Not the Real Story

The AI Blame Game Is Missing the Point#

Let’s Talk About GCP (The Uncomfortable Comparison)#

Cloud Provider Incidents (2024-2025)

The Real Scandal: You Know Better and Do It Anyway#

The Cost of “Good Enough” Architecture#

What About the AI Angle? (Fine, Let’s Talk About It)#

So What Do You Actually Do?#

The Uncomfortable Truth#

Comment Bait (Tell Me I’m Wrong)#

FAQs About Cloud Reliability#

Footnotes#