Previous articles in this series established that disaster recovery is fundamentally a systems engineering problem.
By the time disruption arrives, recoverability has largely already been decided. It is the cumulative result of decisions made long before the first failure occurs.
In practice, those decisions define the technical substrate from which recovery must occur.
Disaster recovery typically depends on a collection of foundational services, state management systems, and system-dependency relationships that enable useful work to resume.
Foundational Services That Enable Disaster Recovery
Most infrastructure and systems depend upon a relatively small set of foundational services that provide identity, access, trust, communication, and coordination.
As a result, the availability or restoration of these services is often the first determinant of survivability in a disaster recovery scenario.
Identity Services
Identity services establish who users, applications, and systems are and what they are permitted to do. Examples include Active Directory, Entra ID, Okta, and LDAP.
Resilience
Fortifying identity services typically involves redundancy, replication, privileged access controls, and protected administrative boundaries.
Recovery
Recovering identity services involves restoring authentication, authorization, service account functionality, and administrative access to dependent systems.
Identity services frequently serve as a prerequisite for both system operation and system recovery.
Name Resolution Services
Name resolution services enable systems to locate one another. Examples include DNS, internal naming services, and cloud-based service discovery mechanisms.
Resilience
Name resolution services achieve resilience through redundant authoritative servers, replication of naming data, geographic distribution, and fault isolation between dependent services.
Recovery
Recovery focuses on restoring authoritative naming data and the communication paths that allow systems to locate one another.
Name resolution failures often appear to be application or network failures even when the underlying systems remain operational.
Trust Services
Trust services establish the cryptographic relationships that allow systems to identify one another, verify authenticity, and communicate securely. Examples include certificate authorities, public key infrastructure (PKI), secrets management platforms, and key management services.
Resilience
Trust services achieve resilience through protected trust anchors, managed key lifecycles, secure storage of cryptographic material, and controlled distribution of certificates and secrets.
Recovery
Recovery focuses on restoring trust chains, certificate issuance, cryptographic keys, and secure communication between dependent systems.
Trust services enable systems to establish authenticated relationships with one another.
Time Services
Time services provide a consistent understanding of time across distributed systems. Examples include NTP servers and time synchronization hierarchies.
Resilience
Time services achieve resilience through multiple authoritative time sources, drift monitoring, redundant synchronization paths, and fault isolation between time hierarchies.
Recovery
Recovery focuses on restoring clock synchronization, validating time-dependent operations, and re-establishing a consistent temporal reference across dependent systems.
Accurate time provides the temporal consistency required for authentication, transaction ordering, logging, replication, and distributed coordination.
Connectivity Services
Connectivity services provide the communication paths through which systems exchange information. Examples include routing infrastructure, switching platforms, firewalls, load balancers, SD-WAN environments, and connectivity services.
Resilience
Connectivity services achieve resilience through redundancy, path diversity, segmentation, and fault isolation.
Recovery
Recovery focuses on restoring communication paths, routing, and connectivity between dependent systems.
Connectivity services enable the communication required for coordination, state synchronization, and recovery.
Control Planes
Control planes manage, deploy, monitor, and coordinate the operation of other systems. Examples include cloud management platforms, virtualization environments, container orchestration systems, configuration management platforms, CI/CD systems, and observability platforms.
Resilience
Control planes achieve resilience through redundancy, privileged access protection, administrative isolation, and independent recovery procedures.
Recovery
Recovery focuses on restoring the management, automation, deployment, and observability capabilities required to recover dependent systems.
Unlike most foundational services, control planes are often both subjects of recovery and tools used to perform it.
State Is As Important As Infrastructure
An enterprise’s systems create, maintain, and exchange state while performing work. That state records where things are, what has happened to them, and what should happen next.
Together, state forms the organization’s digital representation of reality. Account balances, inventory levels, airline reservations, workflow progress, transaction histories, and operational configurations are all examples.
As a result, disaster recovery involves restoring the state required for systems to resume useful work.
Understanding where state resides, how it is protected, and how it is restored is therefore a fundamental aspect of disaster recovery engineering.
Storage Systems
Storage systems preserve the persistent information upon which applications and stateful services depend. Examples include storage arrays, object stores, file systems, and cloud storage platforms.
Resilience
Storage systems achieve resilience through redundancy, replication, snapshots, integrity validation, and protection against corruption.
Recovery
Recovery requires restoring persistent data, validating integrity, and re-establishing access to dependent systems and services.
Storage systems provide the durable foundation upon which most organizational state resides.
Databases
Databases are often the authoritative source of organizational state. They maintain customer records, transactions, inventory, financial information, operational configuration, and countless other forms of persistent data.
Resilience
Databases achieve resilience through replication, backup strategies, consistency validation, and protection against corruption.
Recovery
Recovery involves restoring data, validating integrity, reconciling inconsistencies, and re-establishing normal processing.
A database that is available but contains incorrect information can be more disruptive than a database that is unavailable.
Transaction Managers
Transaction managers coordinate state changes that must be completed reliably across multiple operations, systems, or services. Examples include payment processing, order fulfillment, reservation systems, and financial transfers.
Their purpose is to ensure that completed work remains completed and incomplete work can be identified and resolved appropriately.
Resilience
Transaction managers achieve resilience through durable transaction logs, checkpointing, redundancy, and recovery mechanisms.
Recovery
Recovery focuses on determining which transactions completed successfully, which did not, and how unfinished work can be resolved without introducing duplication or inconsistency.
Transaction managers preserve correctness during failure.
Message Queues
Message queues allow systems to exchange work asynchronously. Rather than requiring every system to be available at the same moment, queues preserve work until dependent systems are able to process it.
Resilience
Message queues achieve resilience through persistence, replication, ordering guarantees, and retention policies.
Recovery
Recovery focuses on restoring queue state, replaying work where appropriate, preventing message loss, and avoiding duplicate processing.
Message queues allow systems to continue accepting work even when downstream systems are unavailable.
Workflow and Process Engines
Workflow systems coordinate long-running business activities whose execution spans multiple participants, systems, and points in time.
Unlike transactions that complete in seconds, workflows may span hours, days, weeks, or even longer.
Resilience
Workflow systems achieve resilience through durable state storage, checkpointing, process tracking, and recovery checkpoints.
Recovery
Recovery focuses on resuming work from known points without restarting entire processes.
Workflow systems preserve process state across extended business activities, enabling long-running processes to survive interruptions without losing progress.
Recovery Follows System Dependencies
Systems are rarely recovered as individual components. They are recovered as collections of interdependent capabilities that rely on one another to function.
As a result, recovery typically follows the same dependency relationships that support normal operations.
Dependency Chains
Dependency relationships often form chains in which one capability relies upon another.
For example, a typical dependency chain:
- Identity services depend on network connectivity.
- Applications depend on identity services.
- Business processes depend on applications.
State often depends on databases, transaction managers, queues, and storage systems that must be restored before useful work can resume.
Dependency relationships are not always obvious:
An application may depend on DNS, Active Directory, certificates, storage, transaction logs, external services, and multiple databases.
Many of these dependencies may not appear in operational documentation (e.g., run books) because they are normally ambient during routine operations.
Disaster recovery activities frequently expose these relationships, forcing organizations to rebuild them explicitly as part of a recovery effort.
Recovery Sequencing
These dependencies create a natural recovery sequence.
Foundational capabilities are restored first, followed by the services that depend upon them, and finally the business capabilities those services enable.
The exact sequence varies by environment, but the principle remains consistent: systems are recovered according to dependency relationships rather than organizational ownership or technology domains.
A typical enterprise recovery sequence may include:
- Network connectivity and communications
- Identity and access services
- Trust and certificate services
- Storage platforms
- Databases and persistent state
- Transaction and messaging systems
- Applications and integration services
- Business processes and user access
Each stage enables the stages that follow.
Disaster recovery sequencing emerges from system structure rather than organizational charts, support teams, or technology domains, whether we like it or not.
External Dependencies
Not all dependencies reside within an organization’s control.
Many business capabilities depend on external services such as payment networks, cloud providers, SaaS platforms, telecommunications services, and partner integrations.
These services may influence recovery outcomes without being directly recoverable by the organization.
Disaster recovery planning must therefore account for both internal and external dependency chains.
Recoverability Engineering
Earlier in the Architecture Blueprints series, Constructing the Baseline Architecture Blueprint established the structural model used to describe enterprise systems and their relationships. That same dependency model forms the foundation upon which disaster recovery depends.
The next challenge is understanding how organizations engineer those systems to tolerate failures, preserve state, and continue operating, even while recovery activities remain underway.
We explore those engineering considerations in Engineering Recoverable Systems.
The Computer Is Going to Do Something – Join an ongoing, practical examination of technology strategy, enterprise architecture, systems engineering, and technology operations.
Notes:
1. Headline image generated by ChatGPT; inspired by Airplane! (1980).

Leave a Reply