On November 19, thousands of Office 365 and Azure users were unable to sign in to their account. The issue, Microsoft quickly clarified, was with a recent update to its multi-factor authentication service.
Now the company has given more detail about the issue and how it plans to avoid it in the future. It appears the outage was a compounding of three major issues that took the service down without notifying Microsoft.
The issues began with latency between the MFA frontend's communication with its cache servers. This occurred only at periods of high usage but often triggered the second root cause:
“The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend,” explained Microsoft.
Unfortunately, the second issue triggered the third root cause. A previous undiscovered issue in the MFA backend meant it was unable to process requests from the front-end. The nature of the issue meant that everything looked normal on Microsoft's monitoring.
As you'd expect, the company has now fixed the issues, but it says it wasn't easy. Apparently, the overlap between various issues made it very difficult to determine the root cause and was also causing gaps in the team's telemetry.
To avoid such issues in the future, Microsoft has promised to update its deployment procedures and testing cycles. It will also review its monitoring services to reduce detection time, and its containment process to stop the issue from spreading to other data centers.
With hope, this will let the company avoid any major issues in the future. Should one happen, though, it has vowed to tell customers more quickly. Most of the above changes should be in place in December, with sever containment changes going live in January 2019.