Early in the COVID pandemic, organizations rapidly transformed offices into software-based workplaces. In the process, something quietly revealing was happening inside the major public cloud platforms.
This new operating model depended on remote access, including virtual desktops and collaboration platforms, while also stressing the infrastructure required to sustain it.
But the cloud platforms did not simply scale. Capacity was constrained, and those limits were managed through techniques like throttling, occasionally reducing service availability to zero, especially in smaller global regions.
But nothing was broken. No contracts were violated.
The cloud platforms were behaving exactly as designed. For those who remember the early days of AWS, this moment recalled a time when capacity reservations were about availability, not cost.
What became visible was not scale, but the limits of how availability had been understood.
When Availability Assumptions Break Under Stress
The cloud platforms were not failing. What failed were assumptions customers had made about what survival means once cloud-based systems are under stress.
Those assumptions were often architectural. Many organizations implicitly equate availability with resilience, and resilience with business continuity and disaster recovery.
In practice, though, availability is neither binary nor evenly distributed. Systems survive disruptions through specific availability patterns, such as single-site high availability, active-passive failover, active-active operation, or deliberate degradation.
Each of these patterns carries different entailments for business continuity and disaster recovery, whether they are acknowledged or not.
Understanding what this means requires stepping back from architectural patterns and examining the conditions that make survival possible in the first place.
First Principles for Business Continuity and Disaster Recovery
In the initial article in this series, What Must Survive: A Practical Look at Business Continuity in a World That Never Stops Failing, we focused on outcomes. What matters here are the conditions those outcomes depend on, what might be called first principles.
In business continuity and disaster recovery, first principles define survival. They are not plans, practices, or models. Rather, they expose what must be true for survival to be possible.
These principles do not predict success or failure. They define the space in which business continuity and disaster recovery become credible and reliable strategies, which should, of course, lead to success.
Business Continuity: First Principles of Survival
Business continuity is not, at its core, about systems. It is about whether the organization itself can persist when conditions no longer resemble normal operations.
When structure, information, and coordination degrade, business continuity is measured by the organization’s ability to continue acting with intent.
Business continuity defines the conditions that allow an organization to function in the absence of normalcy. It focuses on preserving decision-making, authority, and essential obligations while disruption is unfolding, before stabilization or restoration can be assumed.
These conditions can be understood as three first principles that define the boundaries of an organization’s business continuity strategy:
1. Business continuity preserves the minimum viable business, not the full business.
In a crisis, completeness is optional. Survival is not.
Business continuity exists to sustain the smallest set of outcomes required for viability, including financial, legal, and operational concerns. Everything else is negotiable, deferrable, or expendable, though that is easier said than done.
A business continuity plan that assumes all functions must remain available misunderstands disruption. And defining the minimum viable business is often the hardest part of planning for large organizations.
2. Business continuity assumes people, facilities, and suppliers fail alongside technology.
Business continuity plans that focus primarily on technology systems are incomplete by definition.
Real disruptions affect staffing availability, physical access, logistics, vendors, regulators, and customers simultaneously. COVID, for instance, did not simply increase system demand. It removed people from offices, scrambled childcare, constrained travel, and altered customer behavior almost overnight.
Business continuity should treat human, organizational, and third-party dependencies as first-class failure domains, not as footnotes or vague assumptions.
3. Business continuity decisions are time-bound and degradable.
What the business must preserve changes over time.
The minimum viable business at hour one is not necessarily the same as at day three, week two, or month one. Processes may shift from automated to manual. Cybersecurity controls may need to be relaxed temporarily and accounted for differently than under normal operations. Precision may be traded for speed, then reclaimed later.
Business continuity is not binary. It is staged, conditional, and likely imperfect. Plans that do not define how the business degrades over time are likely not sufficient business continuity strategies.
Moving along, whereas business continuity governs organizational survival during disruptions, disaster recovery governs how technology systems return to trusted operation afterward.
Disaster Recovery: First Principles of System Restoration
Disaster recovery is often described as restoration after failure, but that description is incomplete. It really concerns how systems return to trusted operation following disruptions, whether through restoration or architectures designed to recover in flight.
In either case, disaster recovery governs how technical capabilities re-establish integrity, order, and availability after normal operation has already broken down.
Disaster recovery therefore addresses recovery under constraint and design intent. It defines how systems are restored, resumed, or continued in the correct sequence, within acceptable timeframes, and with known levels of trust.
The effectiveness of disaster recovery is measured not only by speed, but also by whether recovered systems reliably support the surviving organization.
These requirements can be understood, again, as three first principles that define the conditions under which disaster recovery can succeed:
1. Disaster recovery is ordered, not parallel.
Systems do not recover all at once, regardless of platform maturity or automation sophistication. Disaster recovery depends on sequencing, meaning that recovery follows dependency-aware sequencing across infrastructure, identity, data, and applications.
During recovery events, many organizations discover that even when some systems and resources become available, they aren’t necessarily the right ones.
For example, even in active-active environments that span multiple sites, sequencing still applies. Control planes, identity systems, data replication paths, and consistency models impose order whether acknowledged or not.
Unfortunately, availability without recoverability can prove to be very fragile at times. It’s fairly easy to fall into the trap where, for instance, active-active systems remain nominally available while also losing the ability to authenticate users, provision sessions, or reconcile state because critical infrastructure restoration isn’t sequenced properly.
Disaster recovery is governed less by raw speed than by correct order. Recovering the wrong system first can often lengthen recovery across everything that follows.
2. Disaster recovery prioritizes data integrity over recovery speed.
A fast disaster recovery approach that produces inconsistent, corrupted, or unreconciled data is probably more of a deferred failure than an exercise in recovery.
Under pressure, teams are tempted to prioritize system availability over correctness. History shows this tradeoff rarely ends well. Data errors introduced during disaster recovery tend to surface later, when context has faded and remediation is costly.
This tradeoff is most acute in multi-site and active-active architectures, where latency, replication lag, and conflict resolution mechanisms quietly determine whether recovery preserves truth or merely preserves motion.
Any disaster recovery approach should treat data as a preserved system of record for truth, not merely another component to be brought online.
3. Disaster recovery requires verification, not just automation.
Automation is essential to modern disaster recovery, but automation alone is insufficient.
Infrastructure can often recover quickly. But that does not mean the systems which use it are usable, coherent, or safe upon recovery. During stressed conditions, precisely when disaster recovery is typically invoked, control planes will throttle, APIs could fail, and dependencies will likely behave unpredictably.
Effective disaster recovery requires observability, checkpoints, and explicit validation. A recovery approach that cannot be verified is often indistinguishable from failure itself.
Why First Principles Matter for Business Continuity and Disaster Recovery
The cloud capacity constraints observed during the early stages of COVID were not anomalies. They reflected the ineluctable limits imposed by first principles that had always been present, acknowledged or not.
And organizations that define business continuity and disaster recovery in these terms tend to understand survival as an architectural problem long before they are forced to test it.
Next: Failure Modes and Recovery Paths in Complex Systems
Defining strategy and first principles is only the beginning. The harder work lies in understanding how systems actually break under stress and how disaster recovery unfolds across architectural and technical structures.
That requires examining failure modes, dependency chains, control planes, and recovery paths as they exist in reality, not merely in diagrams. It also requires understanding how availability architectures behave when they are no longer well provisioned, well coordinated, or only politely stressed.
The next article in this series, How Systems Break and How They Recover: The Architecture Behind Business Continuity and Disaster Recovery will address moving from a defined business continuity strategy to making sure an organization’s systems are capable of achieving it.
But before we do that, we need to establish a base method for expressing architecture in a foundational manner, which we will begin defining in the next article, Why Enterprise Architecture Requires a Baseline Blueprint.
The Computer Is Going to Do Something
Join me in an ongoing, practical examination of enterprise architecture, systems engineering, and technology operations.
Notes:
1. Headline image generated by Gemini and ChatGPT.

Leave a Reply