To ensure data integrity, availability, and fast recovery, our AWS RDS setup includes a multi-layered backup strategy using Snapshots, Point-in-Time Recovery (PITR), and Multi-AZ replication. This approach guarantees that we can restore data efficiently in case of accidental deletion, system failures, or AWS outages.
Backup & Recovery Components
We utilize three key AWS backup mechanisms, each serving a specific purpose:
1. Snapshots (Manual & Automated)
- What It Does: Takes a full backup of the database at a specific moment.
- Use Case: Used for disaster recovery, long-term backups, and cross-account copies.
- How It Works:
- We take daily automated snapshots of our production databases.
- Additional snapshots are created before major changes or deployments.
- Snapshots are stored for 30 days and can be manually restored in case of failure.
- Limitations: Snapshots capture only the state at the time they are taken. Changes after the snapshot starts are not included.
2. Point-in-Time Recovery (PITR)
- What It Does: Allows restoration of an RDS instance to any second within a retention window (up to 35 days).
- Use Case: Recovers from human errors like accidental deletions, bad queries, or unwanted schema changes.
- How It Works:
- PITR uses continuous transaction logs recorded by AWS.
- If a mistake happens, we can create a new RDS instance from a specific second before the error.
- The recovered instance is then used to extract and restore lost data.
- Limitations: PITR does not work if the original RDS instance was deleted before a restore is initiated.
3. Multi-AZ Replication (High Availability, Not a Backup)
- What It Does: Creates a standby replica of the database in a separate Availability Zone (AZ) for failover protection.
- Use Case: Ensures high availability, minimizing downtime if the primary instance fails.
- How It Works:
- AWS synchronously replicates data between the primary and standby instances.
- If hardware fails in one AZ, AWS automatically promotes the standby.
- Snapshots are taken from the standby, preventing performance impact on production.
- Limitations: Multi-AZ does not protect against data corruption or accidental deletions, as changes are replicated instantly.
Why This Strategy Works
By combining Snapshots, PITR, and Multi-AZ, we ensure:
- Protection from accidental data loss (PITR recovers deleted or changed data).
- Disaster recovery readiness (Snapshots provide full backups for major failures).
- High availability (Multi-AZ prevents downtime from hardware failures).
- Minimal performance impact (Snapshots run on the standby instance in Multi-AZ setups).
This multi-layered approach balances speed, reliability, and cost-effectiveness, ensuring that our systems remain resilient under all conditions.