Engineering Recoverable Systems

Sketch illustration inspired by the movie Airplane! showing a man smiling while plugging a power cord into a computer system, symbolizing the misconception that disaster recovery is simply a matter of restoring power.

Previous articles in this series examined the foundational services, state management systems, and dependency relationships that make disaster recovery possible.

This article examines how those systems are engineered to tolerate failures, preserve state, and continue operating when components, services, or environments become unavailable.

Operating Through Disruption

Recovery rarely occurs all at once.

As systems, dependencies, and state are restored, organizations often operate through multiple recovery states that preserve essential capabilities while full functionality is re-established.

These operating states may involve reduced functionality, alternate processing paths, or increased reliance on manual procedures.

They preserve essential capabilities until normal operations can be restored.

Deliberate Degradation

Deliberate degradation allows a system to operate with reduced capability while preserving essential functionality.

Examples include delayed reporting, simplified approval workflows, reduced validation procedures, alternate communication paths, and limited-function operating modes.

Fortification focuses on designing these degraded modes before they are needed.

Recovery focuses on activating them deliberately, understanding their constraints, and returning to normal operation when supporting systems are restored.

Manual Fallback

Manual fallback allows people and procedures to temporarily replace or augment automated system functions.

Examples include manual payment approval, manual customer support workflows, offline order capture, emergency access procedures, and operational command bridges.

Fortification focuses on defining procedures, roles, decision rights, evidence capture, and reconciliation methods. Recovery focuses on using those procedures to preserve useful work until automated systems return.

These operating modes allow organizations to preserve capabilities even when the systems that normally provide them have not yet been fully restored.

How Availability Architectures Shape Disaster Recovery

Availability and recovery are closely related because both emerge from the same architectural decisions.

The mechanisms used to tolerate failures during normal operations often determine how systems behave during disruption and how recovery proceeds afterward.

So availability architectures shape disaster recovery architectures.

Single-Site High Availability

Single-site high availability focuses on maintaining operations during localized component failures.

Examples include clustered servers, redundant storage controllers, redundant network paths, load-balanced application services, and highly available virtualization platforms.

These architectures are designed to tolerate failures involving individual components while continuing to provide service.

Because redundancy exists within the same facility or fault domain, recovery is often automatic and invisible to users. Failed components are isolated, workloads shift to surviving resources, and normal operations continue.

Single-site high availability improves resilience to component failures but provides limited protection against disruptions affecting the entire facility or operational environment.

Active-Passive Architectures

Active-passive architectures maintain a secondary capability that can assume responsibility when a primary capability becomes unavailable.

Examples include secondary data centers, standby databases, disaster recovery sites, and passive cloud environments.

Unlike single-site high availability, these architectures assume that an entire operational environment may become unavailable. As a result, they maintain a secondary environment capable of continuing operations when required, or the ability to provision a software-defined secondary environment on demand.

Recovery occurs through the transfer of responsibility from the primary environment to the secondary environment. Depending on the architecture, this may involve automated failover, controlled activation procedures, or manual recovery processes.

Because only one environment actively processes work at a time, active-passive architectures simplify state management, recovery coordination, and operational governance.

Active-passive architectures trade additional infrastructure and recovery procedures for a clear and predictable recovery path.

Active-Active Architectures

Active-active architectures distribute work across multiple operational environments simultaneously.

Examples include multi-site transaction processing systems, distributed applications, globally distributed service platforms, and geographically dispersed cloud workloads.

Unlike active-passive architectures, multiple environments actively participate in normal operations. Workloads may be distributed across locations for performance, scalability, availability, or resiliency purposes.

Because multiple environments are already processing work, recovery is often embedded within normal operations. When a disruption occurs, workloads can be redistributed to surviving environments without requiring a complete transfer of responsibility.

The tradeoff is that state must be coordinated continuously across participating environments. Data synchronization, consistency management, conflict resolution, and operational governance become significantly more complex as a result.

Active-active architectures reduce dependence on any single operational environment while increasing the engineering required to maintain a consistent view of state.

Multi-Region Architectures

Multi-region architectures distribute systems, services, and state across geographically separated regions.

Examples include cloud workloads deployed across multiple regions, geographically separated data centers, and globally distributed service platforms.

These architectures are designed with the assumption that regional disruptions can occur. For example, natural disasters, infrastructure failures, telecommunications outages, and cloud provider incidents may affect an entire region while leaving others unaffected.

Recovery is achieved by maintaining the ability to continue operations from unaffected regions. Depending on the architecture, workloads may fail over automatically, be redistributed dynamically, or be activated through controlled recovery procedures.

Geographic separation introduces additional considerations related to latency, state replication, consistency management, governance, and operational coordination.

Multi-region architectures extend recovery planning beyond individual systems and facilities to include regional fault domains and geographic risk.

Engineering for Availability and Disaster Recovery

Availability and disaster recovery emerge from specific engineering characteristics embedded within systems and infrastructure.

These characteristics influence how failures are tolerated, how state is preserved, how disruptions propagate, and how recovery proceeds afterward.

While architectures provide structure, these characteristics determine how systems behave when subjected to disruption.

Redundancy

Redundancy provides alternate resources capable of performing the same function. Examples include multiple servers, network paths, storage systems, and service instances.

The purpose of redundancy is to allow useful work to continue when individual components become unavailable.

Redundancy preserves capability when components fail.

Replication

Replication distributes state across multiple locations or systems. Examples include database replication, storage replication, directory synchronization, and distributed caches.

The purpose of replication is to preserve information and make it available beyond the failure of a single component or location.

Replication preserves state beyond the failure of a single system.

Isolation

Isolation limits the propagation of failures between systems. Examples include network segmentation, fault domains, availability zones, blast-radius boundaries, and workload separation.

The purpose of isolation is to prevent localized failures from becoming systemic failures.

Isolation constrains the scope of disruption.

Durability

Durability preserves completed work. Examples include transaction logs, checkpoints, persistent queues, write-ahead logging, and archival storage.

The purpose of durability is to ensure that information survives interruptions and can be recovered accurately afterward.

Durability preserves completed work across interruption and recovery.

Observability

Observability provides visibility into system behavior. Examples include monitoring, logging, tracing, telemetry, and health validation.

The purpose of observability is to understand system state, detect failures, and verify recovery outcomes.

Recovery without observability can quickly become guesswork.

Notes:
1. Headline image generated by ChatGPT; inspired by Airplane! (1980).



Leave a Reply

Discover more from The Computer Is Going to Do Something

Subscribe now to keep reading and get access to the full archive.

Continue reading