Azure Storage Redundancy Explained

2018/12/28

Leveraging commercial cloud services allows increased data redundancy without having to expand and maintain your own infrastructure. A new cloud user can become overwhelmed and confused with the data redundancy options available, especially since each Cloud Service Provider (CSP) tends to use their own terminology. In this post I try to describe the data redundancy methods available in Microsoft Azure in simple terms.

Key Points

Locally-Redundant Storage

The simplest method, Locally-Redundant Storage (LRS), stores three copies of data in a single data center in a Region. Picture it as three separate racks of storage gear in a single data center. It provides 11 9’s (99.999999999%) of durability to your data – quite an impressive number! But what exactly does this mean for your data? Simply put, it’s how safe your data is from unrecoverable loss due to equipment or other failures. The failure modes aren’t fully enumerated but we can assume that drive, hardware, or even entire rack failures can occur and your data not be lost. However were an entire Azure data center (or just the storage scale unit) to suffer a catastrophic failure, your data could be lost. Our imagination can picture what such scenarios would entail and Azure engineers have factored in mitigation methods to assign a durability of 11 9’s to LRS.


Important note: Do not confuse durability with availability. Durability describes the safety of the data. Your data is not lost due to equipment failure or degradation. Availability deals with uptime of the data – you can access it when requested. Your data can be unavailable and still durable. For example, if a power outage or fiber cut rendered an entire data center offline, your data would be unavailable as you would not be able to download it. However it would still be durable – the hard drives still safely nestled in their cages storing your data for when the center comes back online.


Zone-Redundant Storage

Zone-Redundant Storage (ZRS) provides for 12 9’s of data durability. It utilizes an Azure concept called Availability Zones. Azure availability zones are physically separate data centers within a single region. Imagine the region as a campus of three or more buildings, each a data center with separate power, cooling, and network. Each building would be a separate availability zone, so your three data copies would live in three separate buildings. All risks associated with LRS are mitigated, and we now mitigate the risk of a catastrophic failure of an entire building. It would take a regional event to destroy the data. Think devastating hurricane or orbital bombardment.

Data replication for both LRS and ZRS is synchronous and transparent to the user. The data copies complete immediately without user involvement. Failures that impact a single copy of the data occur without notice by the user. (And remember…we’re only talking about durability, not availability.)

Geo-Redundant Storage

Geo-Redundant Storage (GRS) mitigates the impact of a regional event by replicating your data to an entirely different region. Regions in Azure are deterministically paired so you know the secondary region that will be used. As with other methods you get three copies of data in the primary region, and three copies in the secondary region. The extra money you pay for GRS results in 16 9’s of durability.

How do I access this second copy of data, you ask? Well…the question is a bit irrelevant for the topic. We’re only talking about durability of your data – safety from permanent loss. To answer the question – you don’t access the secondary copy until Azure engineers say you do. According to the docs, Azure engineers will work to return the primary region to operational status first. Failing that, they’ll make the call on when to bring data from the secondary region online. It’s cloud…you don’t have to maintain your own infrastructure, but you do have to rely on someone else to maintain theirs.

Azure does offer a way to access the data in the secondary region: Read-Access Geo-Redundant Storage (RA-GRS). For a little more money you can have read-only access to the data copy in the second region. This could allow for limited operations through a failure of a primary region, or maybe you just want to run some heavy reporting without impacting performance on your production storage account.

In GRS the local data replication still occurs synchronously, but replication to the secondary region is asynchronous and on Azure’s schedule. The same data read from the primary and secondary region can provide different results and there is a chance of data loss with a primary region catastrophe prior to the completion of replication.

Conclusion

The number of data copies isn’t the major consideration here. As long as there are multiple copies, the exact number is over-shadowed by the other unknowns of the Azure architecture and physical layout. We assume racks are spaced to survive a single forklift mishap. We assume procurement and logistics ensure all storage devices don’t come from the same lot harboring the same catastrophic defect. There are many factors to consider in durability at the scale of a commercial cloud provider. This article by The Register discusses a lack of standard across the industry for durability calculation.

Rather than look at the durability figures, having an understanding of what each method means and some of the provider-specific caveats (such as replication time) will help you determine which method is right for your data based on criticality and policy. I hope this explanation has helped condense or clarify the available documentation on the subject, at least for those new to cloud.