Critics question Microsoft’s resilience after cooling issue at San Antonio data centre causes global headache
Microsoft was scrambling to get a number of cloud-based services up and running this morning, after a lightning strike near one of its San Antonio data centres caused a voltage surge. This in turn forced a power-down when cooling equipment failed, triggering an Azure outage locally and issues with Office 365 globally.
The outage even took the Azure service status monitor offline.
The company said in an Azure status update: “Engineers are prioritizing the restoration of storage resources in order to recover all services with dependencies on these impacted resources. As storage mitigation continues to progress, a necessary extended mitigation phase is required.”
Mitigation efforts continue. Preliminary root cause details provided. Engineers are seeing signs of recovery for some services. Please refer to your Portal – https://t.co/66mR6nPbwY Status Page – https://t.co/Dw19fIGsXf and/or Twitter for updates. pic.twitter.com/dZQp4RxnOK
— Azure Support (@AzureSupport) September 4, 2018
The company added that while power had been restored and software load balancers for Azure storage scale units recovered, work to recover a number of services was still ongoing. The outage appears to have also affected Office 365 users globally also: the enterprise software is not based on Azure, but does use Azure Active Directory authentication services.
Azure Outage: Enough Being Done on Resilience?
Pete Banham, cyber resilience expert at Mimecast, said: “Today’s incident at Azure was another clear reminder for the need for organisations to build in their own redundancy rather than rely on a single vendor. All organisations, including Microsoft, need to consider what downstream effects there may be from losing a critical service due to technical failure or human error.
“Should employees around the world using Office 365 be reliant on a single Azure DC in the US? Services will always fail and IT leaders need to ensure they have not outsourced responsibility to a lone cloud service.”
Our recovery efforts remain underway; however, most Office 365 services are now restored. We're working to increase resiliency to ensure that previously affected services remain stable. Further details can be found under MO147606 in the admin center or on https://t.co/AEUj8uAGXl.
— Office 365 Status (@Office365Status) September 4, 2018
Speaking to Computer Business Review, he added: “It’s an open question how one data centre outage can cause so much disruption… clearly companies like Microsoft want to benefit from economies of scale, but from a mission-critical side of things, if users can’t even check support, that’s a challenge.”
Microsoft has been contacted for comment.