Ikigai Digital
We build banks
Share this article on
All articles

Architecting for Failure: Lessons from the AWS Outage

October 27, 2025
|
13 min read

AWS had a significant outage in the us-east-1 region on October 20th. This led to significant disruptions across multiple industries, such as:

  • AWS reliant Amazon services such as Chime and Ring were completely offline
  • Multiple payment processors were impacted, leading to stores being unable to accept card payments
  • Flight delays and cancellations, primarily in the US

This highlighted the underlying fragility of our modern interconnected infrastructure. In total, CNN estimated the impact of the outage, which only lasted around one day, at “hundreds of billions of dollars” due to direct impacts and indirect productivity loss.

This has been covered extensively by industry experts, with deep root cause analysis performed by Amazon themselves (spoiler - it was DNS), so we will not bring much value by opining on how AWS could have avoided this or recovered faster. But what we do have thoughts on is how all affected parties could have ensured this outage did not cause a holistic outage of their own offerings.

Here at Ikigai, we do not architect around failure - we architect for failure. Outages, network partitions and disasters will happen, and should be treated as an inevitability. What matters is not if your system goes down, but how it handles it.

The non negotiables: self healing and loose coupling

We believe strongly in many architectural principles that enable velocity, reliability and effectiveness in software development. However, two of the most relevant to building resilient systems are Self Healing Systems and Loose Coupled Systems

Self healing systems and graceful degradation

One of the most important and non-negotiable design properties of any modern system is not avoiding downtime, but how quickly and independently it recovers from it. Self healing systems are able to automatically detect when dependencies come back online and execute recovery actions such as rolling restarts or reconnection attempts without human intervention, minimising both the Recovery Time Objective (RTO) and maintenance overhead.

The other side of the self healing coin is graceful degradation. Graceful degradation ensures systems continue to function, even with reduced capacity during partial failures, by prioritizing core functionality, reducing non-essential features, providing informative user feedback, and implementing fallback mechanisms. This approach minimizes data loss, controls performance impact, and empowers users, ultimately reducing frustration and lost revenue while building trust. Graceful degradation is partially achieved by our next topic - loose coupling.

Loose coupling

Loose Coupling is a fundamental architectural principle that significantly enhances the resilience and agility of complex systems. It dictates that components within a system, or even entire systems interacting with each other, should operate with minimal interdependencies. This design philosophy ensures that changes, failures, or issues in one part of the system have a limited and controlled impact on other parts.

Consider, for example, a scenario where downstream dependencies, such as third-party SaaS eKYC solutions, are integrated into an application. If these external services are hosted in a region that experiences an outage or a performance degradation, a system designed with loose coupling will immediately limit the blast radius of such an incident. This means that only functions directly reliant on those specific eKYC dependencies will be affected. The rest of the application, including its core functionalities and other integrated services, will continue to operate without interruption.

In contrast, a tightly coupled system would likely experience a cascading failure. A problem with the eKYC solution could bring down the entire application, as other components might be unable to function without direct and constant access to the impacted service. This highlights the critical importance of loose coupling in maintaining high availability and fault tolerance, especially in distributed systems that rely on numerous internal and external services.

By isolating functionalities and services, loose coupling not only mitigates the impact of failures but also facilitates independent development, deployment, and scaling of individual components. This fosters greater flexibility, allowing teams to iterate more quickly and adapt to evolving business requirements without fear of disrupting the entire ecosystem.

Best practices for high availability

Outside of the non-negotiable architectural principles outlined above, there are many specific actions small and large an organization can take to limit the impact of infrastructure outages.

Conscious selection of cloud regions

Us-East-1 did not go down by accident. An update to the API of DynamoDB, a key underlying service used by other AWS services, caused an issue in the DNS, rendering other services unable to connect to DynamoDB and therefore bringing them offline (it's always DNS). Us-East-1 is famously the cutting edge AWS region, receiving updates and features first. Selection of region should be treated as an important decision, with multiple factors affecting the choice:

  1. Historical stability of the region and proximity to potential disruption causes
  2. Proximity to customers to minimise latency and network traffic costs
  3. Costs of key resources planned to be used - cloud providers charge massively different rates across geographical regions

There should never be a “Default” geographical location for your system - understand the choices and trade offs you are making by selecting it!

Availability zone replication

While it may not provide a complete solution for widespread, holistic downtime events, implementing replication strategies within a single cloud region across multiple availability zones significantly mitigates the risk of the vast majority of downtime incidents impacting the overall availability and performance of a solution. Industry estimates suggest that over 90% of cloud downtime occurrences are confined to a single availability zone.

By strategically distributing critical components and data across physically distinct, isolated availability zones within the same cloud region, organisations can achieve a higher degree of fault tolerance. In the event of an outage in one availability zone, traffic can be automatically rerouted to healthy instances in other zones, ensuring continuous operation and minimising service disruption. This architectural approach is a fundamental principle of building highly available and resilient cloud-native applications. While it doesn't safeguard against region-wide failures, its effectiveness in preventing the most common forms of downtime makes it an indispensable practice for any organization serious about maintaining high service uptime.

Multi-region replication

For critical applications, implementing a multi-region strategy within a single cloud provider should be considered for achieving superior resilience and ensuring business continuity. This advanced architectural approach involves meticulously replicating applications and their associated data across multiple, geographically distinct regions operated by the same cloud provider. This strategic distribution is designed to proactively mitigate the risks associated with catastrophic failures that could impact an entire single region.

By distributing resources in this manner, businesses can guarantee continuous availability and fault tolerance, even in scenarios where an entire data center or region experiences a complete outage due whether due to natural disasters, widespread power failures, or major network disruptions. The inherent redundancy of a multi-region setup significantly reduces the blast radius of any incident, ensuring that while one region might be affected, others remain operational and capable of serving user requests.

The benefits extend beyond mere disaster recovery. A multi-region strategy enables faster RTO and minimizes data loss (RPO) by allowing for seamless failover to an unaffected region. Furthermore, this approach inherently improves global application performance. By serving users from the closest available region, latency is dramatically reduced, leading to a more responsive and satisfying user experience. This geographical proximity also optimises data transfer and processing, which can be critical for applications with stringent performance requirements.

While highly effective, multi-region replication is not without its drawbacks. Implementing and maintaining a multi-region strategy introduces significant complexities, increased operational overhead, and higher infrastructure costs due to the duplication of resources across geographical locations. Furthermore, data synchronization and consistency across regions can introduce latency challenges and require careful architectural considerations to avoid performance degradation. The increased complexity in deployment, monitoring, and incident response also necessitates specialized skills and robust automation.

Cloud agnostic architecture and multi-cloud replication

Modern tech implementations often fall prey to inadvertent vendor locking by utilising deeply embedded vendor specific technologies. Although the outages cloud providers have encountered to date have recovered, in every example your RTO becomes deeply coupled to the cloud providers RTO. When time of recovery from downtime matters deeply, such as mission critical banking systems, such lock in quickly becomes a systemic risk. Even if we have faith in their ability to quickly restore service, cloud providers have a history of deplatforming their own customers, changing pricing with little warning or removing key features and functionality. 

A cloud agnostic architecture paired with hexagonal architecture ensures that any dependencies on vendor specific services are isolated and easy to replace, ensuring that moving to a cloud provider becomes an infrastructure migration task, not a deep rebuild of core technologies.

Cloud agnostic architecture is an approach of designing the system around open standards, open source technologies or like-for-like alternatives to common vendor locked tooling. This means avoiding proprietary, vendor specific technologies such as DynamoDB or AmazonSQS and selecting open, portable standards such as ScyllaDB or Apache Kafka. This does not mean having to avoid managed services such as RDS or MSK - simply ensuring that like replacements are easily available with other providers.

Hexagonal architecture, or Ports and Adapters, decouples core business logic from external concerns (UIs, databases, APIs). It uses "ports" (interfaces) for interaction and "adapters" (implementations) to connect to specific technologies. This design promotes decoupling, testability, flexibility, and maintainability by isolating the core application from the technology choices surrounding it.

The natural end result of utilising these patterns is the ability to deploy your applications across multiple cloud providers, allowing a single set of source code to operate seamlessly on two different infrastructures. This multi-cloud approach offers various strategies for resilience, including hot-cold deployment and the more advanced hot-hot approaches, each with its own set of trade-offs.

  • Hot-Cold Deployment: In this model, your application runs on a primary cloud provider (hot), while a replica of your infrastructure is maintained in a dormant or less active state on a secondary cloud provider (cold). In the event of an outage in the primary region, the cold environment can be activated, and traffic rerouted. This approach offers a balance between resilience and cost, as the secondary environment consumes fewer resources when inactive. However, it typically involves a longer RTO due to the need for activation and data synchronization.
  • Hot-Hot Deployment: This strategy involves running your application simultaneously and actively across multiple cloud providers. Traffic is distributed between these active environments, and if one provider experiences an outage, the remaining active environments continue to serve requests seamlessly. This approach provides the highest level of availability and the lowest RTO, making it suitable for mission-critical applications that cannot tolerate any downtime. The trade-off is significantly higher operational complexity and increased cost due to maintaining redundant active infrastructure.

While offering unparalleled resilience and flexibility, adopting a cloud-agnostic architecture introduces significant trade-offs in terms of cost and time-to-market. The complexities inherent in leveraging open-source technologies and non-vendor-locked solutions often translate into higher development effort and operational overhead. Teams must invest more time in integrating and managing diverse components, which can be more challenging than relying on the streamlined, often opinionated, offerings of a single cloud provider. This increased complexity can slow down initial development and deployment, thereby extending time-to-market. Therefore, a cloud-agnostic strategy should not be a default choice but rather a carefully considered decision, weighed against the specific availability requirements, budget constraints, and delivery timelines of each application.

Isolated stand in systems

Even with the most robust multi-cloud or multi-region strategies, the complete elimination of downtime for all services remains an elusive goal. For scenarios where even minimal disruption is unacceptable, such as critical payment processing, an "isolated stand-in system" can be a powerful solution. These systems are deployed on completely independent, isolated infrastructure, often with a simpler, more resilient architecture and are designed to offer a small subset of critical features during an outage of the main system.

A prime example of this approach is Monzo Bank's "stand-in bank" system. In the event of an outage impacting their primary banking platform, Monzo has a separate, highly resilient system that can process essential transactions like card payments. This stand-in system doesn't offer the full suite of banking features but ensures that customers can still make purchases and access their funds, minimizing the real-world impact of a major system failure. This approach acknowledges the inevitability of some level of downtime and proactively provides a fallback for the most critical user journeys.

Disaster recovery testing

Any disaster recovery plan, no matter how meticulously designed, is only as effective as the team's and systems' ability to execute it under pressure. Disaster recovery isn't a static process; it's a complex "muscle" that requires constant training and practice. This involves regular disaster recovery tests, war games, and chaos engineering exercises to simulate various unplanned minor and major outages. These simulations help identify weaknesses, refine procedures, and ensure that both automated systems and human operators can respond effectively when real incidents occur. The absolute worst case scenario for any organisation is to discover that a hot-cold deployment that was estimated to come online within minutes actually takes hours to become ready right in the middle of an outage.

The cost of fault resilience

Each of the aforementioned strategies for enhancing fault tolerance. From multi-region replication to isolated stand-in systems, comes with inherent costs. These costs are not merely financial; they also encompass increased operational complexity, greater development effort, and potential trade-offs in performance or feature richness. It is crucial to recognise that not every component or sub-system within a platform demands the same level of availability or resilience. A "one-size-fits-all" approach to fault tolerance can lead to unnecessary expenditure and over-engineering.

Instead, a judicious architectural approach requires a careful weighing of the need for each sub-system to be highly available against the cost of ensuring that availability. For instance, a payment processing service might warrant hot-hot multi-cloud replication due to its criticality, while a less frequently accessed analytics reporting module might be sufficiently resilient with simple availability zone replication or even a less robust failover strategy. By adhering to non-negotiable foundational practices like loose coupling and self-healing systems, organizations can then flexibly invest in enhanced resilience only for the most critical parts of their system, optimizing both cost and system effectiveness.

The other impact: team velocity

The discussion of system-level fault tolerance naturally leads us to another non-negotiable principle at Ikigai Digital: empowered teams and well-defined team topologies. Just as systems need to be designed to gracefully handle failures, teams must be structured and enabled to maintain productivity and adapt when critical tooling or dependencies become unavailable.

During the AWS outage, a key issue reported on was a complete breakdown of workplace productivity. Outages in Canva, Signal and many others led to entire teams putting down tools for the day. The root cause of this was deep dependence on centralised tooling and teams that were disempowered from making their own decisions, plan Bs and workarounds.

Empowered teams are given the autonomy to manage and control their own workloads, timelines, and outcomes. This autonomy is crucial during periods of disruption. When a piece of workplace tooling goes offline, an empowered team can reactively shift its focus, prioritize alternative tasks, or even pivot to using different methods or tools without needing extensive top-down approval. This agility minimizes the "blast radius" of a tooling outage on overall team productivity, much like loose coupling limits the impact of a service failure.

Furthermore, the concept of team topologies provides a framework for organizing teams in a way that promotes clear communication, minimizes dependencies, and enhances overall flow. By clearly defining team responsibilities and interfaces, organizations can reduce the cognitive load on individual teams and allow them to operate more independently.

A strong central Paved Road Enablement team plays a vital role in supporting empowered teams. This team ensures that highly available, best-in-class tools and services are available out-of-the-box for all development teams. This "paved road" provides a reliable and efficient default path, but importantly, empowered teams retain the ability to deviate from this paved road where appropriate and necessary for their specific needs, fostering innovation while maintaining a baseline of operational excellence. This combination of empowered teams, clear topologies, and a robust enablement function creates a human-centric layer of fault tolerance, ensuring that even when systems falter, people can continue to work effectively.

Conclusion

In conclusion, the recurring reality of cloud outages, exemplified by the AWS us-east-1 incident, serves as a powerful reminder that robust fault tolerance is no longer a luxury but a fundamental necessity for modern organizations. Achieving this resilience demands a multi-faceted approach, extending beyond mere technical architecture to encompass organisational structure and team dynamics. By embracing principles like "architecting for failure," prioritising self-healing systems and loose coupling, strategically implementing multi-region and multi-cloud strategies, and diligently testing disaster recovery plans, businesses can significantly reduce their vulnerability to disruptions. Furthermore, fostering empowered, well-structured teams and ensuring the high availability of critical tooling are equally vital in maintaining productivity and agility during inevitable outages. Ultimately, a holistic and proactive commitment to fault tolerance, woven into both technical design and operational practices, is the most effective way to safeguard business continuity and navigate the inherent complexities of the digital landscape.

At Ikigai Digital, we believe resilience is not an optional extra. It is a competitive advantage. Outages like AWS us east 1 expose just how fragile even the most advanced banks and fintechs can be when fault tolerance is not architected from the start.

Our team helps organisations design and build banks and fintechs that expect failure and recover from it gracefully. Whether it is through self healing architectures, multi region deployment strategies, or cloud agnostic design, we ensure your systems and teams can operate effectively even under pressure.

Let's start the conversation.
Connect with us to explore how your organisation can architect for failure and thrive through it.

Start building a better bank

Get in touch