What Business Continuity and Disaster Recovery Must Preserve

Years ago, I helped leaders at a large bank update their company’s business continuity strategy. To them, safeguarding operations was not a contingency plan. It was another way of honoring the trust their customers placed in them.

This unique company, legendary for customer service, also functioned with a command-and-control sensibility that shaped its culture, its operational approach, and the way it engineered systems.

Those same instincts governed its approach to staying resilient when it mattered most.

A Culture Built for Business Continuity

The company maintained a lights-out secondary facility in a remote location, close enough for quick access by air yet distant enough to survive regional events with confidence.

Its plan relied on flying an on-call team there in roughly three hours, to fully operationalize the site following a disaster, a remarkable readiness posture for response considering that about half that time was spent in the air.

They also treated operational readiness and system recovery as more than plans that live only in documentation. They trained for it and refined it over time.

As a result, they could regularly demonstrate operational continuity with a level of discipline that is rare in the commercial sector. But even well rehearsed systems rest on assumptions.

When the Assumptions Failed in a Business Continuity Plan

In September 2001, when U.S. airspace shut down for six days, the organization discovered a critical breakdown rooted in an implicit assumption within the plan: The secondary facility and the operational backup teams were ready, as always, but the company’s planes were not going anywhere.

As a result, the teams responsible for activating the recovery site would not have been able to get there quickly had it been necessary. In effect, a disciplined plan built to handle major incidents proved insufficient, revealing an aporetic tension between the plan’s internal logic and the conditions imposed by the real world.

But the lesson here is not about a single plan. It is about the limits of planning itself.

Defining the Limits of Resilience in Business Continuity and Disaster Recovery

Business continuity and disaster recovery strategies promise structure in a landscape that is often defined by uncertainty.

It is important to define what you can, but also to approach it with a sense of practicality that matches your organization’s risk posture and operating priorities. An understanding and recognition of limits and constraints, along with a little humility, can also help a great deal.

A modern airliner is useful as a metaphor when approaching limitations. Everything has been engineered for operational continuity (e.g., engines, hydraulic systems, layered flight computers).

But even though there are two wings, that is not for redundancy. They are structural, and a plane cannot fly without both. And modern digital systems have their own structural dependencies, though they are often less visible.

Failure at Internet Scale: Cloud Outages and Cascading Risk

Recently, there were three internet-scale disruptions caused by content delivery network and cloud outages at Cloudflare, AWS, and Azure.

For the Cloudflare and Azure issues, small internal configuration or metadata changes caused outsized systemic failures. These were human or automated configuration changes that resulted in outages that propagated globally. I am not exactly sure of the root cause of the AWS outage.

All three outages showed a familiar risk in modern cloud ecosystems: a failure in one layer (e.g., DNS, control plane, routing logic) can cascade through many dependent services. Often, infrastructure and observability tools themselves share dependency chains with production traffic.

Because hyperscale clouds and content delivery networks underpin so much of the internet, outages often ripple far beyond the provider’s own platforms, impacting SaaS applications, enterprise platforms, consumer services, and even other cloud providers.

The appeal to plan for these types of outages is understandable. Accommodating them through system engineering is another story. An expensive one.

This distinction helps clarify what business continuity and disaster recovery are actually responsible for so that we can situate and properly address them strategically, architecturally, and operationally.

What We Mean by Business Continuity and Disaster Recovery

I think it is best to draw on some industry-standard definitions as we get started with this article series, and there are, of course, many to choose from.

I prefer simplicity over comprehensiveness in definitions, avoiding more normative, complicated interpretations where possible, and NIST leads this time:

Business Continuity is the ability to maintain essential functions during and after a disruptive event. [NIST SP 800-34]

Disaster Recovery is the process of restoring IT operations after a disruption. [NIST SP 800-34]

Those distinctions matter most when systems fail completely.

What Business Continuity Actually Preserves When Systems Fail

To illustrate the differences using a pure play tech company example
—and a little hyperbole—consider what would happen to a hyperscaler business, where technology is the business, in the event of a total global outage.

This disruption means the product is unavailable, customer environments and workloads are offline, trust and contractual exposure are at risk, and revenue generation from consumption-based services is a complicated matter. But, the company does not cease to exist. At least not yet.

At this point, where service delivery is completely interrupted, business continuity is about preserving the enterprise as a going concern, managing customer communication and obligations, maintaining legal, regulatory, and financial operations, coordinating recovery efforts at organizational scale, and preventing permanent loss of customers, data, or the enterprise itself.

This is where disaster recovery takes on a more specific and constrained meaning.

What Disaster Recovery Is Responsible For

The disaster recovery approach determines how systems are restored, in what order, with what service levels, and under what constraints.

If disaster recovery fails, the company may fail. But disaster recovery still does not define what survival means, at least in theory.

Succinctly, business continuity defines how the enterprise survives disruption, while disaster recovery defines how technology is restored to support that survival.

To do that, you have to take a hard look at what matters, and, within that, make decisions, prioritize, invest, and, as a matter of practice, operate.

How Purpose Becomes Strategic, Concrete, and Operational in Business Continuity, Disaster Recovery, and Cyber Recovery

This article is the first in a five-part series. The next, Defining Survival: How First Principles Shape Business Continuity and Disaster Recovery, emphasizes and explores how recovery starts as a consideration of purpose and policy, not technology.

The Computer Is Going to Do Something
Join me in an ongoing, practical examination of enterprise architecture, systems engineering, and technology operations.

Notes:
1. Headline image generated by Gemini and ChatGPT; inspired by the “Bordering States” visual from The Simpsons Movie (2007).

What Must Survive: A Practical Look at Business Continuity in a World That Never Stops Failing

A Culture Built for Business Continuity

When the Assumptions Failed in a Business Continuity Plan

Defining the Limits of Resilience in Business Continuity and Disaster Recovery

Failure at Internet Scale: Cloud Outages and Cascading Risk

What We Mean by Business Continuity and Disaster Recovery

What Business Continuity Actually Preserves When Systems Fail

What Disaster Recovery Is Responsible For

How Purpose Becomes Strategic, Concrete, and Operational in Business Continuity, Disaster Recovery, and Cyber Recovery

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from The Computer Is Going to Do Something