In a recent incident that lasted over 10 hours, Microsoft Azure DevOps experienced a significant outage in the South Brazil Region. The cause? A simple typo in the code led to the deletion of 17 production databases.
Wrong Input Causes Azure Outage
The outage was first noticed at 12:10 UTC on May 24 and was remedied by 22:31 UTC on the same day. The root cause was a hidden typo bug in the snapshot deletion job, which was part of a code base upgrade.
Eric Mattingly, Microsoft's principal software engineering manager, explains in a post-mortem article that the typo bug resulted in the deletion of the Azure SQL Server instead of the individual Azure SQL Database. “When the job deleted the Azure SQL Server, it also deleted all seventeen production databases for the scale unit”, Mattingly confirmed.
Recovery Took Over 10 Hours
Despite the severity of the incident, no data was lost during the process. However, the recovery process took over 10 hours, which Microsoft attributed to several factors. Firstly, customers cannot restore Azure SQL Servers themselves, which necessitated the involvement of the Azure SQL team. This process took approximately one hour. Secondly, backup redundancy complications and a “complex set of issues with [Microsoft's] web servers” added to the recovery time.
Lessons Learned and Steps Forward
In the wake of the incident, Microsoft has taken several steps to prevent such an occurrence in the future. “We have already fixed the bug in the snapshot deletion job”, Mattingly stated in the official post-mortem.
Additionally, Microsoft has created a new test for the snapshot deletion job, which fully exercises the snapshot database delete scenario against real Azure resources. The company is also adding Azure Resource Manager Locks to key resources to prevent accidental deletion and ensuring that all future snapshot databases are created on different Azure SQL Server instances from their production databases.
The outage had significant consequences for Azure customers in the South Brazil region, leaving them without access to some services for several hours. This disruption not only affected their day-to-day operations but also highlighted the vulnerability of relying solely on a single service provider. Businesses were forced to halt their activities, which likely resulted in financial losses and operational delays. Furthermore, the incident may have eroded trust in Microsoft's services, as customers expect reliable, uninterrupted access to cloud services. “We understand how impactful Azure DevOps outages can be, and sincerely apologize to all the impacted customers”, Mattingly wrote in the post-mortem article.
This incident underscores the importance for businesses to have contingency plans in place, such as backup systems or alternative service providers, to ensure continuity of operations in the event of such outages. It highlights the importance of having backup plans to reduce reliance on single service providers, including cloud storage and other off-prem infrastructure.