AWS OpenSearch: Multi-AZ with Standby - Practical Trade-offs & Design Guidance

AWS OpenSearch: Multi-AZ with Standby - Practical Trade-offs & Design Guidance

Introduction

Multi-AZ with standby in OpenSearch improves resilience but introduces complexity in real-world scenarios.

Strict Configuration Requirements

- Requires 3 Availability Zones

- Minimum 2 replica shards

- Careful shard distribution across AZs

- Leads to higher storage and cost overhead

Capacity Planning Challenges

- Standby is not idle

- Requires over-provisioning

- Failover load must be handled by active nodes

Shard Distribution Complexity

- Poor distribution leads to hotspots

- Impacts performance and stability during failover

Failover Behavior

- Node promotion delays

- Temporary performance degradation

- Shard rebalancing overhead

Cost vs Value

- Higher infrastructure cost

- Increased storage and cross-AZ transfer

- Must justify based on business impact

When to Use Standby

- Mission-critical applications

- Strict SLAs

- Revenue-impacting downtime

When Not to Use Standby

- Internal tools

- Analytics workloads

- Non-critical environments

Final Thought

Aim for right-sized resilience, not maximum complexity.