Social Dolphin Services
SDS · Field notes

I didn't buy that a storm took down Azure. I was half right.

The weather didn't take down the cloud. A cascade did, and your own architecture decided how much it hurt.

Type
Field note
Date
29 May 2026
Audience
Founders and operators

This morning a headline crossed my feed: a thunderstorm knocked out a Microsoft Azure region. My first reaction was skepticism. A hyperscale region is supposed to be the most over-engineered building most of us will ever depend on, and "a storm did it" sounded like the lazy version of the story.

So I read the status page instead of the headline. I was half right. The weather did not take down the cloud. A cascade did, and the storm was only the thing that started it.

The storm exposed two separate weaknesses at once: the limits of what that specific building was designed to ride through, and the architecture assumptions a lot of customers had quietly made about that region never having a bad enough day. The second one is the part you can actually do something about.

What Microsoft actually said

Starting at 04:27 UTC on May 29, 2026, a severe thunderstorm caused widespread utility power loss across multiple West US 2 datacenter facilities at the same time. Not one building. Multiple, simultaneously.

The backup generators started as designed. Then the chain came apart. During the transition to sustained generator power, a subset of generators could not synchronize to the sudden facility load fast enough. Others started and then shut themselves down on thermal protection, because the cooling systems were degraded by the same power disruption. Partial generation, rising temperatures, hardware protecting itself by powering off. Microsoft's own words: the conditions "exceeded the resiliency designed for this particular failure scenario."

By the time it stabilized, Availability Zone 2 had recovered and was running normally. The residual pain was concentrated in Availability Zones 1 and 3, where a couple of storage stamps were still validating. Services caught in it included Azure SQL, the MySQL and PostgreSQL flexible servers, Functions, Kubernetes Service, Storage, Redis, Databricks, and Application Insights.

Read that carefully and the "redundancy failed" framing gets more honest, not less. The redundancy did not vanish. One zone rode it out. The simultaneous hit across multiple facilities was just bigger than the failure mode the region was built to absorb.

Why a storm can still do this

A hyperscale datacenter is not magic. It is a very complicated power plant with a lot of batteries, a lot of generators, and a lot of cooling, all of which have to hand off to each other cleanly under stress. The marketing about availability hides how messy those handoffs can be.

Two failure modes show up again and again, and both showed up here.

Power transitions are the danger zone

Most big incidents do not happen on steady utility power or steady generator power. They happen in the transfer between them, when load spikes and the synchronization gear has to behave perfectly at the worst possible moment. West US 2 in February of this year is the cleaner example: generators started, but a cascading failure in a control system stopped the automated transfer of load from utility to generator at all. Same region, different link in the same chain.

Cooling is a hard dependency

Once cooling degrades while you are running on imperfect generator power, temperatures climb, thermal protections trip, and you lose compute and network even when gross power is technically present. That is exactly what took the generators offline in this event. It is the same shape as the January West US 2 incident, where cooling recovery lagged behind power, and the West Europe thermal event last November.

None of this is unique to Azure. At this scale, outages are almost never one breaker flipping. They are cascades: grid event, then a power-transition problem, then cooling, then hardware protecting itself, then a multi-service outage. "Five nines" describes the steady state. It does not describe the chain.

The honest turn

So was it the storm, or something deeper? Both, and the split is the whole point.

The storm was the trigger. The root causes were the physical resiliency envelope of that region, which was not built to absorb this exact combination at once, and the architecture choices of the customers who got hurt worst. The ones who felt this hardest had concentrated critical workloads inside a single region with no cross-region failover, because their application quietly assumed that region would never have a morning like this.

Microsoft told customers, in the middle of it, to consider failing over to a paired region. The customers who could do that in minutes had a very different day than the ones discovering at 04:27 UTC that they never actually had that path.

The trade nobody says out loud

Single-region is not always a mistake. It is usually a cost decision that nobody wrote down as a decision.

Active-active across regions is expensive and annoying. You pay for duplicated capacity, you fight data-consistency problems, and you maintain a failover path you hope to never use. Plenty of workloads genuinely do not need it. An internal tool that can be down for an afternoon is fine on one region. A marketing site is fine. The honest move is not "always go multi-region." The honest move is to make it a decision instead of an accident.

That decision has two numbers attached. How much downtime can this service tolerate before it costs you something real, in dollars or trust? That is your RTO. How much data can you afford to lose if a region goes dark mid-write? That is your RPO. If you have never said those two numbers out loud for your important services, you have not chosen single-region. You have defaulted into it, and you will find out your real tolerance during the outage instead of before it.

The negligent version is a revenue-critical system, no replication, no rehearsed failover, and a leadership team that believes "it's in the cloud, it's fine." The fine version is the same single-region setup on a workload where a few hours of downtime genuinely does not matter, chosen on purpose, with eyes open. Same architecture. Opposite levels of responsibility.

How we build the fleet

Short version, because this is not a pitch. We treat a region as a large but fallible failure domain, not a guarantee. For anything that matters, that means cross-region replication on the data, health-based routing that can drain traffic away from a degraded region, and a failover path we have actually run, not just written into a runbook. That last part is where most teams fall down. A DR plan you have never executed is a hypothesis. Game days are how you find the hidden dependency, the one stored credential, the one hardcoded endpoint, before a storm finds it for you.

What this article is not

  • Not a knock on Azure or Microsoft. Their incident write-up was specific and honest about where the design envelope was exceeded, which is more than a lot of providers offer. The same cascade could have hit any hyperscaler, and versions of it have.
  • Not a blanket "go multi-region or you are doing it wrong." Most of what runs in the world does not need active-active, and pretending otherwise just burns money.
  • Not a stack-specific playbook. The right RTO and RPO for your systems depend on your business, not on a blog post.

One-sentence takeaway

Treat every cloud region as a failure domain that will eventually have a bad morning, and decide your tolerance for losing it on purpose, before the storm decides it for you.

Talk to us

If you want the concrete version: send me your stack and your honest answer to "how long can we lose this for, and how much data can we lose," and I will tell you where a West US 2 morning would actually hurt you and what the cheapest meaningful fix looks like.

Not a sales call. A real read on your blast radius.

Sources