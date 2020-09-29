“Has anyone started having discussions with their CIO/CEO about moving back to an in-house mail server? I advocate for it”

Given the scale of its user base and with a contract worth up to $10 billion in the bag to run the back-end of a superpower’s military, Microsoft might want to start thinking about how it can establish a staging procedure for its Azure cloud that allows it to deploy changes and reliably roll back those changes when things break.

(We know, it is easy to say so from a safe distance…)

Redmond was at it again late Monday, knocking an (apparently substantial) “subset of customers in the Azure Public and Azure Government clouds” offline for three hours with swathes of users globally encountering errors performing authentication operations; multiple services were affected, including Microsoft 365.

The company blamed the issue on a “recent configuration change [that] impacted a backend storage layer, which caused latency to authentication requests.” (Read, users couldn’t login to Teams, Azure and more for hours because of the snafu).

A full root cause analysis is pending. (We will update this piece when we see it). The blockage was felt for users from 22:25 BST on Sep 28 2020 to 01:23 BST.

The issue comes a fortnight after a protracted outage in Microsoft’s UK South region triggered by a cooling system failure in a data centre. With temperatures rising, automated systems shut down all network, compute, and storage resources “to protect data durability” as engineers rushed to take manual control.

Earlier this month meanwhile Gartner said it “continues to have concerns related to the overall architecture and implementation of Azure, despite resilience-focused engineering efforts and improved service availability metrics during the past year”.

Microsoft Azure CTO Mark Russinovich in July 2019 said that Azure had formed a new Quality Engineering team within his CTO office, working alongside Microsoft’s Site Reliability Engineering (SRE) team to “pioneer new approaches to deliver an even more reliable platform” following customer concern at a string of outages.

He wrote at the time: “Outages and other service incidents are a challenge for all public cloud providers, and we continue to improve our understanding of the complex ways in which factors such as operational processes, architectural designs, hardware issues, software flaws, and human factors can align to cause service incidents.

“Has anyone started having discussions with their CIO/CEO about moving back to an in-house mail server? I advocate for it” one frustrated user noted on a global Outages mailing list meanwhile… If cloud is your compressed audio stream that you’re not sure you own, it may not be long before in-house mail servers become the vintage quality vinyl of the IT world; old, but very much back in demand.

Stranger things have happened.