When people opened their devices on October 20, 2025, the internet felt strangely still. The cloud, usually buzzing quietly in the background, had gone quiet. For hours, millions of people couldnβt log into work apps, stream a movie, or even check their smart thermostats.
In the background, one of the most advanced digital infrastructures ever built: Amazon Web Services (AWS), was buckling under the weight of a small, almost invisible flaw.
Few people understand what it means to build for, and survive, moments like this better than James Kretchmar, SVP and CTO for Akamaiβs Cloud Technology Group, the team responsible for one of the worldβs most distributed computing platforms.
James, brimming with insights taken from 21 years of helping the company support over 4,400 points of presence globally, shared his thoughts with UCToday on the lessons companies need to learn and what the future of IT resilience will look like.
What Happened: Dissecting the AWS Outage
The Domain Name System (DNS) is the digital address book that enables every website, API, and service to locate one another in milliseconds. When that system falters, everything built on top of it starts to wobble. Thatβs exactly what happened inside AWS on that October morning.
βAccording to Amazonβs public outage report, the root cause was related to DNS, one of the fundamental layers of the internet,β Kretchmar explained. βItβs critical not only for customers accessing services, but also for internal systems.β
At the center of the disruption was something deceptively simple: a race condition. Thatβs a software defect that occurs when two processes expect to happen in a specific order but overlap unpredictably.
βIn this case, that timing defect led to blank DNS responses, which caused parts of the service to fail.β
That minor timing flaw set off a chain reaction. DNS requests began returning empty results. Load balancers failed to connect to healthy nodes. Internal monitoring systems (many of which also relied on AWSβs own DNS) started timing out.
Within minutes, thousands of companies were reporting partial or total outages. Collaboration platforms froze. Retail checkouts stalled. Cloud contact centers went silent.
Even businesses that didnβt use AWS directly were hit indirectly through partners and SaaS vendors that did. Analysts later estimated billions in lost productivity and revenue worldwide. For teams responsible for customer experience and uptime, it was a crash course in resilience engineering; a discipline focused not on preventing every fault, but on ensuring systems can bend without breaking.
βWe can talk about how companies guard against that sort of thing. But fundamentally, it shows how something small and technical can ripple into a major outage because of how dependent everything is on these shared systems.β
Why It Matters: Shared Risk in the Cloud Era
Outages of this scale often start small, buried deep in automation logic or change-control processes. But as Kretchmar notes, IT resilience depends as much on how organizations respond as on how their systems are designed. Every second counts, every dependency matters, and every assumption about reliability is suddenly tested in real time.
The AWS incident forced thousands of leaders to confront an uncomfortable truth. In the age of hyperscale computing, a failure in one providerβs code can quickly become a failure for everyone.
βEven robust providers can suffer from rare, systemic failures.β
The very cloud networks that power todayβs innovation have also woven a single, fragile web of dependence. Three hyperscalers now host more than 70 percent of enterprise workloads worldwide. Despite endless discussion of flexibility and redundancy, most of the worldβs information, communication, and trade still run through only a handful of digital gateways.
That dependency is only growing. Companies have spent the last decade rushing toward the cloud for flexibility and speed. Still, few have invested as much in understanding how to maintain steady operations when the unthinkable happens.
βCloud dependence has created shared risk; a single vendor issue can ripple through global operations. The challenge for every organization is to architect for failure, not just for uptime.β
That phrase βarchitect for failureβ has become a rallying cry in the resilience engineering (link here) community. It means designing systems, processes, and teams that anticipate disruption, detect it early, and adapt in real-time.
It also means recognizing that IT resilience isnβt just the responsibility of infrastructure teams. The C-Suite and boards now have to treat reliability the same way they approach cybersecurity: as a measurable business risk.
Resilience Engineering Lessons for IT & Cloud Architects
For IT leaders and cloud architects, the 2025 AWS outage offered a humbling checklist of what to rethink. Kretchmar broke it down into three clear pillars: architecture, governance, and preparation.
βThere are several pillars to building resiliency. The first is architectural design, making sure your systems can withstand different types of failure, including things like race conditions,β he explained.
βBut architecture alone isnβt enough. You also need mechanisms that prevent one small issue from escalating into a larger outage. That might include self-healing systems that detect when somethingβs gone wrong and automatically mitigate it.β
This is resilience engineering at its most practical: designing for failure, not perfection. Systems must expect turbulence. Itβs what separates mature infrastructure from wishful thinking. Netflix famously tests its environments regularly; Akamai does the same through distributed self-healing networks that re-route traffic around trouble before users ever notice.
Yet the human side of these systems is just as vital as the technology itself.
βBeyond the technology, you need solid change management and governance: reviewing systems regularly and ensuring best practices are consistently applied,β said Kretchmar. βIncident management is also critical, so when something does go wrong, your team knows exactly how to respond.β
Service disruptions will always happen, but disorder doesnβt have to follow. The most prepared organizations schedule thorough change reviews, roll out updates in stages, and keep well-practiced rollback procedures ready to go.
βScenario planning is invaluable,β Kretchmar advised. βRun βwhat ifβ exercises in peacetime, simulate major failures, identify gaps, and close them before they become problems.β
Lessons for Security Leaders
When the cloud falters, security often takes the hit first. Logs stop updating. Alerts fall silent. The very tools meant to detect and contain threats can vanish in the same outage that caused the crisis. The 2025 AWS incident was no exception.
Kretchmar told us, βResiliency really matters for security systems. Itβs crucial to probe your vendors on how they maintain reliability, not just their SLA numbers.β
Itβs easy to assume that security vendors, such as firewalls, monitoring systems, and identity platforms, are immune to the same risks that bring down business applications. They arenβt. Most run on the same cloud backbones, governed by the same control planes.
That means the same race condition that knocked out DNS could just as easily silence an intrusion-detection feed or disable authentication for an entire workforce.
For Kretchmar, the difference between surviving and suffering in those moments comes down to diligence.
βAsk detailed questions: How do they roll out changes? How do they phase deployments to avoid breaking things? Weβve seen major incidents where updates were pushed too quickly and caused outages in security software.β
He added:
βSo, for security leaders, itβs about due diligence. Understand your vendorsβ processes deeply and make sure their reliability practices match the criticality of their role.β
Resilience Engineering Lessons for the C-Suite & Boards
For executives, the 2025 AWS outage was a pivotal moment in the boardroom. Overnight, service interruptions that began in data centers rippled into investor calls, customer support escalations, and front-page news. James Kretchmarβs advice to the C-suite is disarmingly straightforward:
βBoards can approach this the same way they already think about cybersecurity risk.β
That framing matters. Cybersecurity has long been viewed as a collective responsibility, backed by dedicated funding, regular reporting, and constant auditing. Cloud reliability and business continuity should be governed with that same seriousness, Kretchmar noted, before adding:
βIdentify potential risks, understand your exposure, and ensure thereβs a clear plan to mitigate those risks. You donβt need to prescribe technical solutions, just create the framework and keep the discussion active.β
In other words, executives donβt need to be cloud architects; they just need to ask the right questions:
- Where are our single points of failure?
- How do our vendors test their own IT resilience
- When was our last real-world simulation of a full-scale outage?
The answers reveal how prepared a company truly is, and the questions should be asked again, regularly.
βRegularly reviewing and reassessing resilience helps keep everyone aligned and ensures it remains a top priority.β
Reliability isnβt built in crisis; itβs shaped by culture and by leaders who value operational stability as much as innovation. Resilience engineering becomes an integral part of brand protection, safeguarding both customer trust and shareholder confidence.
Pragmatically, some executives are turning to partners who can shoulder some of that load. However, that doesnβt entirely take the work away from leaders.
Lessons for Business & Strategy Leaders
Probably the biggest question for leaders to ask following the AWS outage is this: βShould business leaders be avoiding cloud concentration?β
Kretchmar believes that question sits at the heart of every modern strategy conversation.
βItβs definitely worth considering, though the right approach depends on the use case,β he said. βFor workloads like virtual machines or object storage, multi-cloud makes sense. Designing with portable technologies allows you to switch clouds if one fails. The key is avoiding lock-in with proprietary features.β
That flexibility is the essence of a multi-cloud strategy, the ability to move or replicate workloads across providers without rewriting everything from scratch. Itβs a major part of resilience engineering, but one that many struggle with. IDC estimates that more than 80 percent of global enterprises now use more than one cloud provider, yet only a fraction can shift production workloads seamlessly when disaster strikes.
βTechnologies developed under the Cloud Native Computing Foundation (CNCF) are great examples,β Kretchmar added. βTheyβre open, portable, and supported across providers.β
These open frameworks enable companies to build once and run anywhere, thereby reducing their dependency on any single vendorβs quirks or outages.
Still, Kretchmar cautioned against blind adoption.
βThat said, there are exceptions. For example, with security solutions, having multiple overlapping systems can create more risk than resilience. Integration complexity can introduce its own failure points. So for security, itβs often better to pick one robust solution and go deep with it.β
Thatβs the balance leaders now face: freedom versus simplicity, flexibility versus focus. The answer lies in business priorities. A retailer might value redundancy; a healthcare provider might value regulatory clarity.
βIf you depend too much on one providerβs unique services, you lose flexibility. For most workloads, using open, portable standards helps prevent that, though there are exceptions. But overall, itβs a smart way to maintain control.β
Looking Ahead: Resilience Engineering for the Future
Every major outage leaves behind two kinds of companies: those that rush to patch the problem and those that decide to change the way they think. James Kretchmar belongs firmly to the second camp. His final reflections arenβt about AWS at all; theyβre about the discipline required to make resilience a habit, not a headline.
βIt really comes down to consistent attention and investment. Itβs too easy to ignore reliability until something breaks. Just like with cybersecurity, we have to recognize reliability as a critical, ongoing commitment,β Kretchmar said. βThe growing complexity of systems is part of the challenge; complexity can be the enemy of reliability if not managed carefully.β
But as Kretchmar says, it βcanβ be managed. Organizations just need to focus on it every day, not just after a crisis. βAt Akamai, even with our strong track record, weβve learned hard lessons along the way. Twenty years ago, we had an incident caused by a bad change, which led us to overhaul our systems to make them far more robust. Those investments have paid off ever since.β
Sustainable IT resilience means accepting that the work is never done. Governance reviews, incident drills, and multi-region tests all form part of an ongoing cycle of improvement.
True cloud reliability, then, isnβt just about failovers and backups. Itβs about culture. Teams that celebrate uptime, learn openly from mistakes, and build feedback loops into every deployment create systems that genuinely improve with time. Those who treat resilience as a compliance box tend to encounter the same failures again, albeit at a higher cost.
Resilience Engineering: A Shared Responsibility
The 2025 AWS outage reminded every CIO, CTO, and boardroom that resilience isnβt something you buy; itβs something you build.
James Kretchmarβs reflections make one thing clear: resilience is everyoneβs business. From engineers writing deployment scripts to executives approving budgets, the ability to withstand disruption now defines an organizationβs credibility as much as its customer experience.
For Kretchmar, it all comes back to discipline and humility:
βWhen the cloud provider fails, you discover how much you truly depend on it. The question isnβt if youβll have an outage, itβs when, and how youβll respond.β
Ultimately, engineering for resilience isnβt a guarantee of perfection. Itβs a culture of readiness, tested in real-world pressure, and proven by every organization that chooses to learn before the lights go out again.