The Systems That Make Recovery Possible: Foundational Services, State, and Dependencies

Previous articles in this series established that disaster recovery is fundamentally a systems engineering problem.

By the time disruption arrives, recoverability has largely already been decided. It is the cumulative result of decisions made long before the first failure occurs.

In practice, those decisions define the technical substrate from which recovery must occur.

Disaster recovery typically depends on a collection of foundational services, state management systems, and system-dependency relationships that enable useful work to resume.

Foundational Services That Enable Disaster Recovery

Most infrastructure and systems depend upon a relatively small set of foundational services that provide identity, access, trust, communication, and coordination.

As a result, the availability or restoration of these services is often the first determinant of survivability in a disaster recovery scenario.

Identity Services

Identity services establish who users, applications, and systems are and what they are permitted to do. Examples include Active Directory, Entra ID, Okta, and LDAP.

Resilience

Fortifying identity services typically involves redundancy, replication, privileged access controls, and protected administrative boundaries.

Recovery

Recovering identity services involves restoring authentication, authorization, service account functionality, and administrative access to dependent systems.

Identity services frequently serve as a prerequisite for both system operation and system recovery.

Name Resolution Services

Name resolution services enable systems to locate one another. Examples include DNS, internal naming services, and cloud-based service discovery mechanisms.

Resilience

Name resolution services achieve resilience through redundant authoritative servers, replication of naming data, geographic distribution, and fault isolation between dependent services.

Recovery

Recovery focuses on restoring authoritative naming data and the communication paths that allow systems to locate one another.

Name resolution failures often appear to be application or network failures even when the underlying systems remain operational.

Trust Services

Trust services establish the cryptographic relationships that allow systems to identify one another, verify authenticity, and communicate securely. Examples include certificate authorities, public key infrastructure (PKI), secrets management platforms, and key management services.

Resilience

Trust services achieve resilience through protected trust anchors, managed key lifecycles, secure storage of cryptographic material, and controlled distribution of certificates and secrets.

Recovery

Recovery focuses on restoring trust chains, certificate issuance, cryptographic keys, and secure communication between dependent systems.

Trust services enable systems to establish authenticated relationships with one another.

Time Services

Time services provide a consistent understanding of time across distributed systems. Examples include NTP servers and time synchronization hierarchies.

Resilience

Time services achieve resilience through multiple authoritative time sources, drift monitoring, redundant synchronization paths, and fault isolation between time hierarchies.

Recovery

Recovery focuses on restoring clock synchronization, validating time-dependent operations, and re-establishing a consistent temporal reference across dependent systems.

Accurate time provides the temporal consistency required for authentication, transaction ordering, logging, replication, and distributed coordination.

Connectivity Services

Connectivity services provide the communication paths through which systems exchange information. Examples include routing infrastructure, switching platforms, firewalls, load balancers, SD-WAN environments, and connectivity services.

Resilience

Connectivity services achieve resilience through redundancy, path diversity, segmentation, and fault isolation.

Recovery

Recovery focuses on restoring communication paths, routing, and connectivity between dependent systems.

Connectivity services enable the communication required for coordination, state synchronization, and recovery.

Control Planes

Control planes manage, deploy, monitor, and coordinate the operation of other systems. Examples include cloud management platforms, virtualization environments, container orchestration systems, configuration management platforms, CI/CD systems, and observability platforms.

Resilience

Control planes achieve resilience through redundancy, privileged access protection, administrative isolation, and independent recovery procedures.

Recovery

Recovery focuses on restoring the management, automation, deployment, and observability capabilities required to recover dependent systems.

Unlike most foundational services, control planes are often both subjects of recovery and tools used to perform it.

State Is As Important As Infrastructure

An enterprise’s systems create, maintain, and exchange state while performing work. That state records where things are, what has happened to them, and what should happen next.

Together, state forms the organization’s digital representation of reality. Account balances, inventory levels, airline reservations, workflow progress, transaction histories, and operational configurations are all examples.

As a result, disaster recovery involves restoring the state required for systems to resume useful work.

Understanding where state resides, how it is protected, and how it is restored is therefore a fundamental aspect of disaster recovery engineering.

Storage Systems

Storage systems preserve the persistent information upon which applications and stateful services depend. Examples include storage arrays, object stores, file systems, and cloud storage platforms.

Resilience

Storage systems achieve resilience through redundancy, replication, snapshots, integrity validation, and protection against corruption.

Recovery

Recovery requires restoring persistent data, validating integrity, and re-establishing access to dependent systems and services.

Storage systems provide the durable foundation upon which most organizational state resides.

Databases

Databases are often the authoritative source of organizational state. They maintain customer records, transactions, inventory, financial information, operational configuration, and countless other forms of persistent data.

Resilience

Databases achieve resilience through replication, backup strategies, consistency validation, and protection against corruption.

Recovery

Recovery involves restoring data, validating integrity, reconciling inconsistencies, and re-establishing normal processing.

A database that is available but contains incorrect information can be more disruptive than a database that is unavailable.

Transaction Managers

Transaction managers coordinate state changes that must be completed reliably across multiple operations, systems, or services. Examples include payment processing, order fulfillment, reservation systems, and financial transfers.

Their purpose is to ensure that completed work remains completed and incomplete work can be identified and resolved appropriately.

Resilience

Transaction managers achieve resilience through durable transaction logs, checkpointing, redundancy, and recovery mechanisms.

Recovery

Recovery focuses on determining which transactions completed successfully, which did not, and how unfinished work can be resolved without introducing duplication or inconsistency.

Transaction managers preserve correctness during failure.

Message Queues

Message queues allow systems to exchange work asynchronously. Rather than requiring every system to be available at the same moment, queues preserve work until dependent systems are able to process it.

Resilience

Message queues achieve resilience through persistence, replication, ordering guarantees, and retention policies.

Recovery

Recovery focuses on restoring queue state, replaying work where appropriate, preventing message loss, and avoiding duplicate processing.

Message queues allow systems to continue accepting work even when downstream systems are unavailable.

Workflow and Process Engines

Workflow systems coordinate long-running business activities whose execution spans multiple participants, systems, and points in time.

Unlike transactions that complete in seconds, workflows may span hours, days, weeks, or even longer.

Resilience

Workflow systems achieve resilience through durable state storage, checkpointing, process tracking, and recovery checkpoints.

Recovery

Recovery focuses on resuming work from known points without restarting entire processes.

Workflow systems preserve process state across extended business activities, enabling long-running processes to survive interruptions without losing progress.

Recovery Follows System Dependencies

Systems are rarely recovered as individual components. They are recovered as collections of interdependent capabilities that rely on one another to function.

As a result, recovery typically follows the same dependency relationships that support normal operations.

Dependency Chains

Dependency relationships often form chains in which one capability relies upon another.

For example, a typical dependency chain:

Identity services depend on network connectivity.
Applications depend on identity services.
Business processes depend on applications.

State often depends on databases, transaction managers, queues, and storage systems that must be restored before useful work can resume.

Dependency relationships are not always obvious:

An application may depend on DNS, Active Directory, certificates, storage, transaction logs, external services, and multiple databases.

Many of these dependencies may not appear in operational documentation (e.g., run books) because they are normally ambient during routine operations.

Disaster recovery activities frequently expose these relationships, forcing organizations to rebuild them explicitly as part of a recovery effort.

Recovery Sequencing

These dependencies create a natural recovery sequence.

Foundational capabilities are restored first, followed by the services that depend upon them, and finally the business capabilities those services enable.

The exact sequence varies by environment, but the principle remains consistent: systems are recovered according to dependency relationships rather than organizational ownership or technology domains.

A typical enterprise recovery sequence may include:

Network connectivity and communications
Identity and access services
Trust and certificate services
Storage platforms
Databases and persistent state
Transaction and messaging systems
Applications and integration services
Business processes and user access

Each stage enables the stages that follow.

Disaster recovery sequencing emerges from system structure rather than organizational charts, support teams, or technology domains, whether we like it or not.

External Dependencies

Not all dependencies reside within an organization’s control.

Many business capabilities depend on external services such as payment networks, cloud providers, SaaS platforms, telecommunications services, and partner integrations.

These services may influence recovery outcomes without being directly recoverable by the organization.

Disaster recovery planning must therefore account for both internal and external dependency chains.

Recoverability Engineering

Earlier in the Architecture Blueprints series, Constructing the Baseline Architecture Blueprint established the structural model used to describe enterprise systems and their relationships. That same dependency model forms the foundation upon which disaster recovery depends.

The next challenge is understanding how organizations engineer those systems to tolerate failures, preserve state, and continue operating, even while recovery activities remain underway.

We explore those engineering considerations in Engineering Recoverable Systems.

The Computer Is Going to Do Something – Join an ongoing, practical examination of technology strategy, enterprise architecture, systems engineering, and technology operations.

Notes:
1. Headline image generated by ChatGPT; inspired by Airplane! (1980).

The Systems That Make Disaster Recovery Possible

Foundational Services That Enable Disaster Recovery

Identity Services

Resilience

Recovery

Name Resolution Services

Resilience

Recovery

Trust Services

Resilience

Recovery

Time Services

Resilience

Recovery

Connectivity Services

Resilience

Recovery

Control Planes

Resilience

Recovery

State Is As Important As Infrastructure

Storage Systems

Resilience

Recovery

Databases

Resilience

Recovery

Transaction Managers

Resilience

Recovery

Message Queues

Resilience

Recovery

Workflow and Process Engines

Resilience

Recovery

Recovery Follows System Dependencies

Dependency Chains

Recovery Sequencing

External Dependencies

Recoverability Engineering

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from The Computer Is Going to Do Something