Earlier this month, Microsoft suffered a significant outage at one of its data centers. A problem in the company’s South Central US center caused Visual Studio Team Services, Azure Active Directory, and Azure Bot Service to be down. The company offered a brief summary of the cause at the time but has now offered a more detailed explanation.
At the time, Microsoft offered the following information:
“We continue to investigate issues in our services impacting all regions. The underlying issue seems to be a datacenter cooling outage in South Central US which is impacting all services in a specific part of the datacenter.”
In a new Azure blog post, the company explains what happened. The explanation is part of Microsoft’s cloud commitment to maintain transparency with customers. In the post, the company says it experienced swells and sags in voltage in South Central US, causing the center to knock itself into generator power.
This basically means the datacenter took itself off the utility grid and onto backup power. The power swells then shut down the mechanical cooling, as Microsoft alluded to in its initial response. With cooling systems non-operational, the system went into an automatic shutdown to ensure unsafe temperature conditions were not reached.
While this was a failsafe, it did not have its intended impact:
“This shutdown mechanism is intended to preserve infrastructure and data integrity, but in this instance, temperatures increased so quickly in parts of the datacenter that some hardware was damaged before it could shut down. A significant number of storage servers were damaged, as well as a small number of network devices and power units.”
Containment and Recovery
Microsoft says it focused on containing the problem to prevent it spreading. In its first response, the company says it would keep the services down to implement a proper recovery. That’s because some server components needed to be replaced and customer data needed to be migrated:
“The decision was made to work towards recovery of data and not fail over to another datacenter, since a fail over would have resulted in limited data loss due to the asynchronous nature of geo replication.”