Multi-AZ vs. Multi-Region in the Cloud

March 19, 2024

When architecting your application’s infrastructure in the cloud for better uptime and business continuity, you will likely face the question of using multiple availability zones (AZ) vs. multiple cloud regions vs. both multiple regions and multiple AZs per region. In this post, we will discuss how these options compare and how to use them to achieve your objectives. The information is backed by 7 years of our experience at FlashGrid enabling mission-critical Oracle databases on AWS and Azure clouds for organizations of various types and sizes.

Fast forward to the summary: Use multi-AZ for high availability and uptime SLA of 99.99% or higher. Multi-AZ can also protect you against data center outages or “local disasters”, which happen more often than major disasters. Use multiple regions for protection against a regional cloud service outage or a major disaster. The table below sums it up.

	Multi-AZ	Multi-Region
Active-Active HA	Yes	No
Active-Passive HA	Yes	No
Protection against Data Center Outage	Yes	Yes
Protection against Regional Cloud Service Outage	No	Yes
Protection against a Local Disaster	Yes	Yes
Recovery from a Major Disaster	No	Yes
Latency	<1 ms	10-100 ms
Data Replication	Synchronous	Asynchronous
RPO	Zero	Non-Zero

To learn more, read on.

Regions and Availability Zones

AWS, Azure, and Google clouds have multiple independent regions. The fact that the regions are operated independently provides protection against most cloud outage types. Typically, there are long distances between regions, which makes the use of multiple regions suitable for protection against even major disasters. However, this comes at the cost of high latencies, which makes the use of synchronous data replication between regions impractical.

Each region is partitioned into three or more availability zones (except for a few Azure regions that do not yet have AZs). Each availability zone consists of one or more discrete data centers housed in separate facilities, each with redundant power, networking, and connectivity. Each availability zone is physically separate, so local disasters like fires or flooding would affect one AZ only.

Although availability zones within a region are geographically isolated from each other, they have direct low-latency network connectivity between them, making them the more practical choice for synchronous data replication.

High Availability and Fault Tolerance

A highly available architecture ensures that the application service continues running with minimal or no downtime when any of its parts is taken offline for maintenance or fails unexpectedly. Keeping the services up through planned maintenance events is achievable through redundancy or failover of resources located in the same data center or availability zone. However, spreading the resources across availability zones is highly desirable for better fault tolerance.

With “traditional” self-built and self-managed data centers, implementing and maintaining an Active-Active or even Active-Passive HA architecture spanning multiple data centers could be prohibitively complex and expensive. But today, by leveraging cloud availability zones and Infrastructure-as-Code (IaC), it is easy to implement such HA architectures using out-of-the-box blueprints and templates.

The uptime SLAs are getting stricter with more critical services now delivered directly to consumers through online and mobile apps. If one hour of unplanned downtime is unacceptable for your service then you probably need uptime SLA of 99.99% or higher, which requires the use of multiple availability zones. AWS, Azure and GCP all have uptime SLAs of 99.9% for a single VM and 99.99% for a pair of VMs spread across two AZs.

An uptime SLA of 99.999%, which is only 5 minutes of downtime per year, is achievable by combining the multi-AZ approach, double redundancy, active-active HA, and strict maintenance and change management processes. The most challenging may be configuring the active-active HA in the database tier. In the case of Oracle Database, Oracle RAC clustering technology can be used for active-active database HA across availability zones.

Data Center Outages

A data center outage is characterized by a significant percentage of resources located in the same data center failing simultaneously. Possible reasons for such outages range from software configuration errors from cloud providers to backbone switch failures and weather events causing temporary power and cooling outages.

At FlashGrid, we’ve seen that the most frequent cause has been storage service outages lasting several hours. FlashGrid customers with multi-AZ clustering got through the outages with minimal or zero disruption. After the outage, some unlucky customers with single-AZ architecture in the affected data centers decided to switch to multi-AZ.

Other notable examples of cloud data center outages:

Cooling failures affected data centers of Google Cloud for 14 hours and Oracle Cloud for 22 hours in the UK on July 19, 2022, due to extremely hot weather. Only one availability zone was affected in Google Cloud: the scope of impact in Oracle Cloud is less clear.
A lightning storm in Sydney on August 30, 2023, affected the power supply of the cooling system in the data center shared by Azure Cloud and Oracle Cloud. In Azure Cloud, only one of the three availability zones was affected. Unfortunately, Oracle Cloud customers had only this availability zone (or “availability domain” in Oracle Cloud’s terminology) in that region.

There are alternative mechanisms for reducing the chances of being affected by a data center outage, such as spread placement groups in AWS or availability sets in Azure. However, they are limited in their ability to protect against some failures and thus are not a proper substitute for a multi-AZ architecture.

Regional Cloud Service Outages

Some of the cloud services, such as Amazon S3, are “regional” (vs. “zonal”), which means the distribution of the service resources across availability zones is managed by the cloud provider, and not by the customer. A regional cloud service typically has sufficient redundancy and fault tolerance to protect it against most failures. However, outages of a regional cloud service may still happen and have happened in the past; most are caused by software or configuration errors.

Any critical dependencies of a “zonal” resource (e.g. a database or an application instance located in a particular AZ) on regional services should be eliminated or minimized.

If your application stack has a critical dependency on a regional service, it may be a good idea to plan for a failover to a different region. Having the entire application stack in a “warm” state in a different region is very helpful for reducing the RTO (Recovery Time Objective).

A crucial difference between this scenario and a major disaster affecting the entire region is that a service outage will likely be resolved within hours and without data loss or infrastructure loss. This means you can switch back to the previously affected region without having to rebuild your data or infrastructure.

Local Disasters affecting a single AZ

A local disaster like fire or flooding may take a data center offline for a long time or even destroy it completely.

While a multi-AZ configuration is not strictly a disaster recovery solution, it is an important part of protecting against local disasters. Local disasters are more frequent than major disasters affecting an entire region, making protection against them even more valuable.

Let’s consider an extreme case where an entire availability zone is destroyed by a fire. A multi-AZ configuration will allow you to keep your application stack running in the other unaffected zones until you are ready to redeploy the affected resources to a new availability zone or do a graceful switch-over of the entire stack to another region — with no data loss.

Major Disasters affecting 2+ AZs or an entire Region

Because availability zones are geographically separate, a disaster that takes down two or more AZs simultaneously for an extended period of time must be of extremely high magnitude: a major earthquake, tsunami, or war. Thankfully, events like these are rare, and none have yet impacted AWS, Azure, or GCP. Nevertheless, a Disaster Recovery (DR) plan must be in place to account for eventualities affecting a whole region.

The length of downtime an organization can afford during a major regional disaster will dictate the DR plan: from restoring all data and infrastructure from scratch in a different region to keeping a full replica of the infrastructure and a “warm” copy of the data in a second region.

Latencies between AZs

We have seen that the number one concern about a multi-AZ setup for database clusters (and Oracle RAC clusters specifically) is the extra latency between availability zones. Although it is true that latency between AZs is significantly higher compared to latencies within the same AZ, they are still low enough for synchronous replication or data mirroring. This includes very latency-sensitive operations, such as Cache Fusion in Oracle RAC database clusters.

Network latency between availability zones usually varies between 0.3 and 2 ms (milliseconds) depending on the region and specific pairs of AZs, but in most cases, it is less than 1 ms. This makes the multi-AZ deployments compliant with Oracle guidelines for extended-distance database clusters. You can also find and use the AZ pair with the least latency; for example, in the AWS us-west-2 region, usw2-az1 and usw2-az3 have a 0.3 ms latency between them. To learn how to identify inter-AZ latencies and set up the best configuration, read this article from FlashGrid’s knowledge base.

Latencies between regions

In most cases, latencies between regions are in the order of 10 to 100 ms (milliseconds) — prohibitively high for synchronous replication or mirroring of data. We cannot wait for a block of data to arrive in another region before “committing” a write; otherwise, writes would be too slow. Instead, asynchronous replication must be used between regions, which means the most recent data changes will not be in the secondary replica if the primary replica fails.

This makes the cross-region replication unsuitable protection against types of failures that are relatively frequent. Most cannot afford the cost of losing transactions so often.

However, when protecting against extremely rare major disasters that could abruptly take down an entire region, asynchronous DR replication has an acceptably small risk of losing the most recent transactions: it’s a reasonable tradeoff.

Achieving zero RPO (Recovery Point Objective)

Zero RPO means no data is lost in case an unexpected failure or a disaster hits. It is desirable and may be required in many cases, especially for transactional databases. The reason is simple – you don’t need to worry about how much and what type of data is lost. Dealing with even a small amount of data loss may lead to costly complications.

Achieving zero RPO is one of the foremost advantages of a multi-AZ configuration. A well-designed multi-AZ architecture protects against most infrastructure failure types and has a latency low enough for synchronous replication or data mirroring.

The cost of inter-AZ traffic

The cost of traffic within a single AZ is zero, but the traffic between AZs is not free. The price is typically USD 0.02 per GB (gigabyte). With large amounts of data transferred between AZs the cost of the traffic can become high. However, with a well-designed architecture and proper management of resources, the inter-AZ traffic costs can usually be minimized.

A good starting point is first to estimate the costs; here is an example: let’s assume a 100 TB database with 30% updates over 1 year plus 30% growth over 1 year. This means there are 60 TB of total data changes. Then, there is a multiplication factor, typically between 2 and 6, because of data-change duplication, such as write to the data file plus to the online (redo) log file. You can measure the precise multiplication factor in a test setup, but in this example, we will assume the multiplication factor is 3, which results in a total of 180 TB of data transferred between AZs over 1 year. Therefore, the yearly estimated price of traffic would be USD 3,600 (= 180,000 GB x 0.02 USD/GB). This cost may seem substantial, but it is orders of magnitude lower than the total costs of running a 100 TB database.

An important pitfall to avoid with inter-AZ traffic is the cost of continuously running write-intensive “test” workloads in a multi-AZ configuration because it would generate very large amounts of traffic. You can either limit the duration of such tests or use a single-AZ setup for the test environment.

Conclusion

Availability zones and multi-AZ configurations in the public clouds enable a new level of resiliency and fault-tolerance at a much lower cost than traditional self-managed “on-premises” data centers. The opportunity to leverage multi-AZ architectures for achieving better uptime SLA is a major reason to migrate to the cloud. Multi-AZ must be the default approach to high availability, fault tolerance, and local disaster protection when architecting mission-critical application stacks. Multi-region architectures and DR strategies can be layered on top of multi-AZ and will depend on specific business requirements and objectives.

To learn more about multi-AZ and multi-region configurations and discover how FlashGrid Cluster increases the uptime of Oracle databases, including Oracle RAC, on AWS, Azure, and GCP, contact us today.

External resources

Multi-AZ vs. Multi-Region in the Cloud

Share with

Regions and Availability Zones

High Availability and Fault Tolerance

Data Center Outages

Regional Cloud Service Outages

Local Disasters affecting a single AZ

Major Disasters affecting 2+ AZs or an entire Region

Latencies between AZs

Latencies between regions

Achieving zero RPO (Recovery Point Objective)

The cost of inter-AZ traffic

Conclusion

Share with

Subscribe to our updates

Subscribe to our updates

Related news

Escala 24×7 ensures non-stop operation of a money transfer application with FlashGrid Cluster and Oracle RAC database

FlashGrid now supports RHEL 9 and Oracle Linux 9

Leading SaaS vendor in the SCM space achieves High Availability on AWS with Oracle 19c SEHA and FlashGrid Cluster

External resources

FlashGrid comes 73rd in FT’s America’s Fastest Growing Companies 2023

Oracle databases, even RAC, are supported on AWS, Azure, and GCP

Gain 3x higher Oracle RAC database performance on Amazon EC2 paying less

Establishing Amazon EC2 infrastructure for Oracle RAC database migrations

Considerations for migrating heavy IO workloads to Azure, including Oracle

Related videos

1-minute introduction to FlashGrid

Oracle RAC on AWS with FlashGrid – 1-minute introduction

Oracle RAC on Azure with FlashGrid – 1-minute introduction

Your request was sent successfully

Something is wrong!

Launch FlashGrid Cluster in your AWS account now

Launch FlashGrid Cluster in your Azure account now

Launch FlashGrid Cluster in your GCP account now

Iniciar FlashGrid Cluster ahora, en tu cuenta AWS

Iniciar FlashGrid Cluster ahora, en tu cuenta Azure

Iniciar FlashGrid Cluster ahora, en tu cuenta GCP

Contact Us