5.4 Architecture Best Practices and Design Principles
Key Takeaways
- Design for failure: assume everything will fail and architect for automatic recovery using Multi-AZ, Auto Scaling, and health checks.
- Loose coupling means components interact through well-defined interfaces (APIs, queues) so failures do not cascade.
- Horizontal scaling (adding more instances) is preferred over vertical scaling (getting a bigger instance) for reliability.
- Serverless architectures remove the operational burden of managing servers and scale automatically with demand.
- The AWS Well-Architected Tool helps you review workloads against best practices from the six pillars.
Architecture Best Practices and Design Principles
Design for Failure
"Everything fails, all the time." — Werner Vogels, CTO of Amazon
This philosophy drives AWS architecture best practices. Build systems that expect and handle failures gracefully.
Strategies for Fault Tolerance
| Strategy | Implementation |
|---|---|
| Multi-AZ Deployment | Deploy resources across multiple AZs for resilience |
| Auto Scaling | Automatically replace failed instances |
| Health Checks | ELB and Route 53 detect and route away from unhealthy resources |
| Backup and Recovery | Automated backups, snapshots, cross-Region replication |
| Decoupled Architecture | Use SQS/SNS so component failures do not cascade |
Loose Coupling
Loose coupling means components interact through well-defined interfaces (APIs, message queues, event buses) rather than direct dependencies. If one component fails, others continue to function.
Tightly Coupled vs. Loosely Coupled
| Tightly Coupled | Loosely Coupled |
|---|---|
| Components directly call each other | Components communicate via queues or APIs |
| Failure in one = failure in all | Failure in one is isolated |
| Scaling requires scaling all components | Components scale independently |
| Changes require coordination | Components can be updated independently |
AWS services for loose coupling:
- Amazon SQS — Message queues between components
- Amazon SNS — Pub/sub notifications
- Amazon EventBridge — Event-driven architecture
- AWS Step Functions — Workflow orchestration
- Amazon API Gateway — API interface between components
Elasticity and Scalability
Horizontal vs. Vertical Scaling
| Horizontal Scaling (Scale Out/In) | Vertical Scaling (Scale Up/Down) | |
|---|---|---|
| Method | Add/remove more instances | Get a bigger/smaller instance |
| Limit | Virtually unlimited | Limited by largest instance type |
| Downtime | None (add instances behind load balancer) | Often requires restart |
| Availability | Better (multiple instances = no SPOF) | Single point of failure |
| Example | 4 x t3.large instead of 1 x m5.4xlarge | Upgrade from t3.large to m5.4xlarge |
On the Exam: AWS generally favors horizontal scaling because it is more resilient (no single point of failure) and offers virtually unlimited scaling. Questions about scalability and availability usually point toward horizontal scaling + load balancing.
Serverless Architecture
Serverless means you do not manage any servers. AWS handles provisioning, scaling, patching, and availability.
Common Serverless Architecture
| Component | Service |
|---|---|
| API endpoint | API Gateway |
| Business logic | Lambda |
| Data storage | DynamoDB |
| File storage | S3 |
| Authentication | Cognito |
| Notifications | SNS |
| Workflow | Step Functions |
Benefits of serverless:
- No server management
- Automatic scaling
- Pay-per-use pricing
- Built-in high availability
- Reduced operational overhead
High Availability Patterns
Multi-AZ Architecture
Deploy resources across multiple AZs within a Region:
- ELB distributes traffic across AZs
- EC2 instances in multiple AZs
- RDS Multi-AZ for database failover
- S3 automatically stores data across AZs
Multi-Region Architecture
For the highest level of fault tolerance and disaster recovery:
- Route 53 failover routing between Regions
- S3 Cross-Region Replication
- DynamoDB Global Tables
- Aurora Global Database
Disaster Recovery Strategies
| Strategy | RTO/RPO | Cost | Description |
|---|---|---|---|
| Backup & Restore | Hours | $ | Back up data, restore when needed |
| Pilot Light | 10s of minutes | $$ | Core infrastructure running, scale up when needed |
| Warm Standby | Minutes | $$$ | Scaled-down version running, scale up when needed |
| Multi-Site Active/Active | Near zero | $$$$ | Full production in multiple Regions |
On the Exam: Know the four DR strategies and their tradeoffs. Backup & Restore is cheapest but slowest. Multi-Site is fastest but most expensive.
Key Architecture Questions for the Exam
| Question Pattern | Think About |
|---|---|
| "Most highly available" | Multi-AZ, Auto Scaling, ELB |
| "Most cost-effective" | Right-sizing, Savings Plans, Spot, serverless |
| "Most secure" | Least privilege, encryption, private subnets |
| "Decouple components" | SQS, SNS, EventBridge |
| "Reduce operational overhead" | Managed services, serverless |
| "Global low latency" | CloudFront, Route 53 latency routing, Global Accelerator |
Which of the following is an example of loose coupling in cloud architecture?
Why does AWS generally recommend horizontal scaling over vertical scaling?
Which disaster recovery strategy provides the LOWEST cost but the LONGEST recovery time?
A company wants to build a REST API with no server management, automatic scaling, and pay-per-request pricing. Which combination of services should they use?
Which TWO strategies help achieve high availability for a web application on AWS? (Select TWO)
Select all that apply