Module 55: High Availability and Resilience

CCSP Domain 5 — Cloud Security Operations Section B 6 min read

The CCSP exam tests availability as a security objective, not just an operational concern. When the exam presents a resilience question, it is asking you to protect the availability leg of the CIA triad. Every architectural decision about redundancy, failover, and recovery is a security decision.

Availability as a Security Objective

Many candidates think of availability as an operations topic. The CCSP exam disagrees. Availability is one-third of the CIA triad, and attacks against availability — denial of service, ransomware, data destruction — are security incidents. The exam expects you to apply security thinking to availability architecture.

The exam tests availability through two lenses: preventing downtime through architectural resilience, and recovering from downtime through business continuity and disaster recovery.

Cloud Availability Concepts

Availability Zones

Cloud providers organize infrastructure into availability zones — physically separate facilities within a region. The exam tests whether you distribute workloads across multiple availability zones to survive single-zone failures. Deploying to a single zone creates a single point of failure that the exam considers unacceptable for production workloads.

Regions

Regions are geographically separate collections of availability zones. The exam tests multi-region deployment for disaster recovery and for serving users with low latency across geographies. The exam also tests whether you consider data sovereignty when selecting regions — some jurisdictions restrict where data can be stored.

Fault Tolerance vs. High Availability

The exam distinguishes between these concepts. High availability means the system continues operating with minimal downtime, potentially with brief interruption during failover. Fault tolerance means the system continues operating with zero interruption because redundant components take over transparently. Fault tolerance is more expensive and complex.

Recovery Objectives

The exam tests two critical metrics that drive availability architecture:

Recovery Time Objective (RTO): The maximum acceptable time to restore service after a disruption. Shorter RTOs require more investment in automated failover and standby resources.
Recovery Point Objective (RPO): The maximum acceptable data loss measured in time. An RPO of one hour means you can lose up to one hour of data. Shorter RPOs require more frequent backups or real-time replication.

Exam trap: RTO and RPO are business decisions, not technical decisions. The exam expects the business to define acceptable RTO and RPO, and the technical team to architect solutions that meet those requirements. If a question asks who defines RTO and RPO, the answer is business stakeholders, not IT.

Cloud Resilience Patterns

Active-active: Workloads run simultaneously in multiple locations. Traffic is distributed across all locations. Provides the lowest RTO and highest availability but is most complex and expensive.
Active-passive: Primary site handles all traffic. Standby site is ready to take over. Provides fast failover but the standby site consumes resources without serving traffic.
Pilot light: Minimal core infrastructure runs in the DR site. Full environment is spun up during failover. Lower cost but longer RTO.
Backup and restore: Data is backed up to a secondary location. In a disaster, infrastructure is rebuilt and data is restored. Lowest cost but longest RTO.

The exam tests your ability to match the resilience pattern to the business requirements. A critical financial system with an RTO of 15 minutes needs active-active or active-passive. A development environment can use backup and restore.

Backup Security

The exam treats backups as a security concern. Backup data must be encrypted at rest and in transit. Backup access must be controlled with separate credentials from production. The exam increasingly tests immutable backups — backups that cannot be modified or deleted for a defined retention period — as protection against ransomware.

Testing Resilience

The exam expects organizations to test their resilience architecture regularly. Tabletop exercises, failover testing, and chaos engineering are all valid approaches. The exam tests whether you verify that recovery actually works before you need it.

Common Exam Traps

Single availability zone: Never acceptable for production. Always distribute across zones.
Untested backups: A backup that has never been tested is not a backup. The exam expects regular restore testing.
Technical RTO/RPO decisions: Business defines the objectives. Technology implements them.
Ignoring data sovereignty: Multi-region deployment must respect jurisdictional data requirements.

Key Takeaways for the Exam

Availability is a security objective within the CIA triad. Cloud resilience uses availability zones and regions for redundancy. RTO and RPO are business-defined metrics that drive technical architecture. Resilience patterns range from backup-and-restore (cheap, slow) to active-active (expensive, fast). Backups must be secured, encrypted, and tested. Testing resilience is as important as building it.