Skip to content

Commit 7c7b2f4

Browse files
authored
Merge pull request #1 from cloud-tinkerers/asutliff/ha-dr-article
asutliff/ha dr article
2 parents 511ed27 + 1c23bb1 commit 7c7b2f4

5 files changed

+96
-1
lines changed
255 KB
Loading
345 KB
Loading
517 KB
Loading

content/articles/do-you-need-certifications.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -89,4 +89,4 @@ Of course, there's nothing to stop you from doing the above and gaining a new ce
8989

9090
# The Takeaway
9191

92-
You may not feel like you got a solid answer from this article. But unfortunately this is a murky area. There are absolutely certifications out there that employers are looking for and will pay off for you in the long run. But there are also ones that may not benefit you in your situation, and the time/money spent on them would be better used elsewhere. Don't get sucked into believing that a new certificate will cure what ails ya. Remind yourself that certain companies make a lot of money from certifications, and they may be selling you a solution to a problem that they have marketed to you.
92+
You may not feel like you got a solid answer from this article. But unfortunately this is a murky area. There are absolutely certifications out there that employers are looking for and will pay off for you in the long run. But there are also ones that may not benefit you in your situation, and the time/money spent on them would be better used elsewhere. Don't get sucked into believing that a new certificate will cure what ails ya. Remind yourself that certain companies make a lot of money from certifications, and they may be selling you a solution to a problem that they have marketed to you.
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
---
2+
title: "Azure Storage HA/DR, Private Networking, and Client Caveats"
3+
date: 2025-12-26T14:00:00-05:00
4+
language: en
5+
featured_image: ../assets/images/posts/storage-ha-private-networking-and-client-caveats.png
6+
summary: How I set up disaster recovery for Azure Datalake across two un-paired regions using Terraform
7+
description: How I set up disaster recovery for Azure Datalake across two un-paired regions using Terraform
8+
author: Andrew
9+
authorimage: ../assets/images/global/icon.webp
10+
categories: career
11+
tags: Career, Cloud, Azure, Terraform, Infrastructure as Code, Disaster Recovery
12+
---
13+
**As a DevOps/ Platform Engineer,** one of the questions stakeholders are going to ask at some point is "what is our plan for disaster recovery?" As I was approaching a year in with this particular client, it was unsurprising to hear this question brought up again. At just two years into my own Cloud and DevOps journey, this was my trial by fire moment. The other Architect level resource had just rolled off the project, and it was on me to design and implement full fail-over of the client's Azure Big Data Analytics stack.
14+
15+
## The Stack
16+
**A quick background** to the tooling we were using for this client. Data gets ingested to a single Azure Datalake from multiple different sources. This could be from live streaming data, logs, product shipment data, etc. From there, a combination of automated or manual data translation jobs from Azure Data Factory and Azure Databricks - Azure Data Factory was originally used to orchestrate jobs in Databricks, but as Databricks has rolled out new automation features it was slowly retired. These jobs would record changes in a "delta table," and store these tables in a standard medallion architecture in folders labeled as such - bronze, silver, and gold - in storage containers within the datalake. Each instance of this stack was deployed over three different environments for data engineers - dev, test, and prod - with a fourth environment solely dedicated to infrastructure development. We had a fifth environment as well, but it was for deploying data governance with Azure Purview, and is irrelevant to this article.
17+
18+
## The Requirements
19+
**Our team had no strict SLAs to meet,** since this project supported a smaller group of about 20 or so engineers across two teams. As long as the data was still accessible by those that needed to consume it, the ETL pipelines could go down for a bit and no-one internally would be affected by an outage. Additionally, I was accommodating the client's increasingly tighter budget constraints, and knew they would be more than happy with a "good-enough" solution, rather than something that could withstand a nuclear-war and keep ticking.
20+
After a bit of back-and-forth with the client, the client and I decided on a hot/warm with active standby approach. Azure Datalake had already been set up to use GRS - geo-redundant storage - and did not include a hot standby instance. The rest of the stack was deployed one-to-one in a secondary Azure region. A complete secondary standby may seem costly at first, but this satisfied the budget requirements as well. Most of the stack uses SaaS or PaaS products on Azure. The cost of Azure Databricks is dependent on DBU and compute consumption, Data Factory doesn't charge until it has run, and empty Key Vaults don't cost anything. For the nominal cost of some pre-wired networking, and bootstrapping secrets in some of those Vaults, we were able to keep costs at about $100 a month. These were further reduced when the client started to use Datafactory and Databricks' built-in ingestion mechanisms instead of running Self Hosted Integration Runtimes (SHIRs).
21+
22+
## The Biggest Challenge
23+
**In consulting, one of the biggest challenges** is adhering to some unique customer requirements. Azure has paired regions for the High Availability of many of their resources. If one region goes down for some reason, the underlying data is accessible at the second region, albeit with some depredations to the data management plain. The client's main Azure deployment region is East US. The paired region for East US is West US, but the client's backup Azure deployment region is North Central US. 
24+
This raised some complications. For starters, everything ran behind Azure Private Link Private Endpoints. That meant that I wasn't able to rely on the mechanisms that Azure uses for automatic fail-over. However, there were some saving graces. A resource in Azure can have multiple private endpoints attached to it, and the endpoints can cross regions to attach to different resources. Step one complete, create a secondary endpoint in North Central US and attach it to the existing datalake. 
25+
But what about DNS? How do we route traffic away from the default paired region in Azure during a fail-over? How does Azure Datalake's DNS fail-over for GRS even work to begin with under the hood?
26+
27+
## A CNAME on a CNAME
28+
**Azure Storage has a globally unique public FQDN** for each of the resources created on its public cloud. This is one of the reasons why Azure Storage account needs a globally unique name. This FQDN comes up as `<storage_acount_name>.blob.core.windows.net.` However, this isn't the A record for the storage account, but rather a CNAME record that points to the host's A record `blob.<regionally_specific_storage_stamp>.store.core.windows.net` On top of this, added Private Link inserts another CNAME in-between the `<storage_account_name>.blob.core.windows.net` and `blob.<regionally_specific_storage_stamp>.store.core.windows.net.` The full chart looks something like this:
29+
30+
| **Seq** | **Name** | **Type** | **Record Value** |
31+
| --- | --- | --- | --- |
32+
| 1 | <storage_acount_name>.blob.core.windows.net | CNAME | <storage_acount_name>.**privatelink.blob.core.windows.net** |
33+
| 2 | <storage_acount_name>. **privatelink.blob.core.windows.net** | CNAME | blob.<regionally_specific_storage_stamp>.store.core.windows.net |
34+
| 3 | blob.<regionally_specific_storage_stamp>.store.core.windows.net | HOST (A) | <IP Address> |
35+
chart from: [dmauser - Private Link/Endpoint DNS Integration Resources](https://github.com/dmauser/PrivateLink/blob/master/DNS-Integration-Scenarios/README.md)
36+
37+
**When a fail-over happens,** the region specific storage stamp in the A record gets rewritten in the background to represent the change to the new regional endpoint target for the storage account. The critical part here, is that this rewrite to the new region stamp happens after any CNAME rewrites. The good news is, that means I didn't have to worry about routing DNS queries from North Central US to West US, and I had free-reign to to design the DNS failover. The bad news is that now I'm on the hook for designing the DNS failover. 
38+
39+
## Design
40+
**I had two design options** depending on which part of the infrastructure needs to be pre-configured. I could either use a single DNS zone for all the private endpoints, or have each private endpoint link to a unique private DNS zone in their respective regions. 
41+
| DNS Zone | Pros | Cons | Fail-over Steps |
42+
| --- | --- | --- | --- |
43+
| Global | Easier post fail-over steps | Not all DNS can be pre-configured, more difficult to set up in Terraform | Update DNS Record to Secondary Endpoint |
44+
| Multi Regional | Infrastrcuture can be pre-configured, faster/ automatic fail-over from Azure internal resources | More challenging Fail-over steps for non-Azure traffic | External Traffic: Re-route traffic to Secondary Region by using Traffic Manager or updating routing tables. Internal Traffic: Dependent on Status of Other Secondary Resources |
45+
46+
**Option 1 - Single Global Private DNS Zone**
47+
With a single Private DNS zone, the client could continue to use their existing DNS forwarding methods. A second Private Endpoint is created in North Central US, and linked back to the Storage Account in East US. The drawback of this setup, is that the DNS record cannot be set up in advance for the new private endpoint, as it is connected to the same Private DNS Zone, and has the same FQDN as the existing endpoint. 
48+
![Global/ Common Private DNS Zone Design](../assets/images/posts/Cross\ Region\ Data\ Lake\ Failover - Global\ DNS.png)
49+
50+
Option 2 - Regional Private DNS Zones
51+
With option two, two DNS Zones are set up and can route queries independently of each-other. This has a few benefits: the infrastructure can be a true Active/Active or Active/Standby as all infrastructure is already set up, DNS records don't have to be updated in a fail-over, and standby resources in the secondary region will automatically be routed through the second endpoint. However, this now brings in a new challenge. External traffic now has a point of failure since there is no global DNS resolution to redirect traffic at fail-over. Traffic needs to be redirected from East US to North Central US, either with Azure Traffic Manager pointing to an Ingress point for automatic routing changes, or updating routing tables for on-premises to Azure traffic.
52+
![Regional Private DNS Zone Design](../assets/images/posts/Cross\ Region\ Data\ Lake\ Failover - Regional\ DNS.png)
53+
54+
I presented these two options to the client, with the expectation that they would take the "Single-global DNS zone" option. The existing DNS zone for this storage account already used this design, and it would be easier to slot in the first design option. Additionally, this option posed the lowest cost burden.
55+
You can read more about HA/DR consideration for Azure Private Link in Adam Stewart's whitepaper here, or watch his accompanying YouTube video here.
56+
57+
---
58+
59+
## Terraform
60+
**Let's talk about the code.** On top of the challenges faced from a design standpoint, the terraform code had to be re-factored as to not redeploy - and lose any existing data in - the datalakes already deployed in the lower environments. The DNS configurations for private link also needed to be changed in a way that creates the new private endpoint, but doesn't re-create the DNS configuration.
61+
Since we already had our environment names passed in through our deployment pipelines, I decided on using a `count = lower(var.environment) != "disasterrecovery" ? 1 : 0` ternary expression to gate the re-deployment of the datalake. If I were to set up an out-of-the-box module for this in the future, I would add a feature-flag boolean variable akin to `is_disaster_recovery` for more flexibility. I then added a moved block to account for this change. Frustratingly, this then broke every single resource that depended on the datalake resource block. Due to how Terraform creates its dependency tree, it's easy to run into issues with computed fields that depend on other resource blocks. The workaround for this was not adding another `count` field, but adding a ternary condition based on the length of the `azurerm_storage_account.datalakeresource` set created by the earlier `count` expression I set. 
62+
63+
**example:** customer managed encryption key that depends on the datalake.
64+
```hcl
65+
resource "azure_storage_accout_customer_managed_key" "datalake" {
66+
count = length(azurerm_storage_account.datalake) > 0 ? 1 : 0
67+
68+
storage_account_id = length(azurerm_storage_account.datalake) > 0 ? azurerm_storage_account.datalake[0].id : null
69+
70+
key_vault_id = azurerm_key_vault.this.id
71+
key_name = length(azurerm_storage_account.datalake) > 0 ? azurerm_storage_account.datalake[0].name : null
72+
}
73+
```
74+
To wrap everything up, I added in a dynamic block for the DNS configuration of of the Blob and Datalake DFS private endpoint connections based on the same ternary condition as the datalake. 
75+
```hcl
76+
resource "azurerm_private_endpoint" "datalake_endpoint" {
77+
name = "example"
78+
resource_group_name = var.resource_group_name
79+
location = var.location
80+
subnet_id = var.subnet_id
81+
82+
private_service_connection {
83+
<removed for brevity>
84+
}
85+
86+
dynamic "private_dns_zone_group" {
87+
for_each = lower(var.environment) != "disasterrecovery" ? [1] : []
88+
content {
89+
name = "default"
90+
private_dns_zone_ids = var.networking.private_dns_zone_ids
91+
}
92+
}
93+
}
94+
```
95+
With these changes, I was able to add the disaster recovery environment to our deployment pipeline, and enable the secondary region without making changes to the existing environments.

0 commit comments

Comments
 (0)