In the early hours of October 20, 2025, Amazon Web Services (AWS) experienced one of the most significant service disruptions of the year. The incident originated in the US-EAST-1 (Northern Virginia) region — one of the provider’s oldest and busiest zones, hosting critical components of its global services.

For several hours, thousands of applications and online platforms — including Snapchat, Fortnite, Duolingo, Alexa, Coinbase, and Robinhood — suffered connection errors, latency issues, or partial outages.

AWS later confirmed an increase in errors and response times across several services tied to that region.

What Really Happened in Virginia

Technical analyses and AWS status reports indicate that the issue was linked to a failure in the internal DNS resolution system, one of the most critical elements of its infrastructure. The Domain Name System (DNS) translates domain names (such as api.mycompany.com) into IP addresses so that applications can communicate with each other.

When this system fails, servers may continue running, but they can no longer “find” one another — requests are lost because domain names cannot be resolved. In this case, the outage affected AWS’s internal DNS service, which depends on DynamoDB to store DNS zone data.

As that service degraded, many applications could no longer resolve the names of their own instances or connect to their databases, even though the servers themselves remained operational.

The result was a regional failure with global impact, as countless organizations rely on infrastructure hosted in Virginia to operate their services — even when their users connect from other continents. It was not a global AWS outage, nor a multi-region event, but rather a localized failure in a critical region that exposed how much dependency can concentrate in the cloud.

Designing for Continuity: The Role of Global Traffic Management

The cloud offers scalability and simplified management — but it does not eliminate architectural responsibility. Replicating servers within the same region (for example, across “Availability Zones”) does not guarantee continuity if the entire region becomes unavailable.

Building a truly robust infrastructure means designing for the complete loss of a region — and still being able to keep services online. This is where global traffic management, also known as Global Server Load Balancing (GSLB), becomes essential. Such a system operates above individual data centers or regions.

It continuously monitors multiple distributed endpoints and automatically redirects traffic to the one that remains available and responsive. If a region stops responding — as happened in Virginia — the load balancer can update public DNS records so that users are routed to another active environment.

In practice, this mechanism provides two fundamental benefits:

  • Automatic failover between regions, ensuring that an outage in one location does not interrupt global service.
  • A foundation for disaster recovery, since continuity no longer depends on manual actions or static configurations.

However, for this approach to be truly effective, the regions or data centers involved must be completely independent from each other. If both environments share the same control plane, internal DNS, or network services, a failure in that common layer could affect both simultaneously.

That’s why GSLB can only ensure real continuity when deployed between operationally isolated environments. In other words: GSLB would not have prevented the AWS outage, but it would have allowed organizations with independent regional architectures to keep their services running while the affected region recovered.

How SKUDONET Applies This Approach

SKUDONET Enterprise Edition integrates a Global Server Load Balancing (GSLB) system designed to maintain service availability across geographically distributed data centers or regions.

Operating at the DNS level, it continuously monitors the health of applications in each location. If one site becomes unavailable, it automatically updates DNS resolution to redirect users to another operational data center.

The GSLB can operate in active-passive mode, ensuring automatic recovery in Disaster Recovery scenarios, or in active-active mode, sharing traffic between multiple data centers to optimize latency and overall performance.

Its design allows combining environments within the same provider or across different ones — as long as they remain operationally independent — thereby avoiding single points of failure.

In this way, SKUDONET provides an external control layer that strengthens high availability and service continuity strategies, even during severe regional disruptions.

📘 Technical reference:
How Global Server Load Balancing works in SKUDONET

The AWS outage in Virginia showed that even the most mature infrastructures can experience critical regional failures. The lesson is not to avoid the cloud, but to design with failure in mind — assuming that any region can go offline at any time. Separating environments and managing traffic at a global level does not eliminate errors, but it allows business operations to continue when they occur.