Microsoft has today explained the cause of the latest Multifactor Authentication (MFA) collapse from a week ago. Redmond points to “Severe packet loss” between a network route between Microsoft and Apple Push Notification Service (APNS).

On October 18, Azure and Office 365 users in North America reported problems with their Multifactor Authentication and sign-ins. Users with MFA could not sign-in properly to those services for a three-hour period.

Microsoft confirmed the problem and that .51 percent of users in the region were affected.

In an explanation post, Microsoft says a hotfix for the problem was quickly created to completely bypass the external service. This restored MFA functionality long enough for the network to recover. When that happened, Microsoft engineers rolled back the hotfix.

Microsoft apologized for the issue and problems it caused customers. To mitigate any future similar problems, Microsoft says Azure will be improved to prevent it happening again.

What’s Next

Microsoft says it is taking the following steps to avoid the problem reoccurring in Azure.

“In-progress fine-grained fault domain isolation work has been accelerated. This work builds on the previous fault domain isolation work which limited this incident to North American tenants. This includes:  

– Additional physical partitioning within each Azure region.
– Logical partitioning between authentication types.
– Improved partitioning between service tiers.

Additional hardening and redundancy within each granular fault domain to make them more resilient to network connectivity loss. This includes:

– Improved resilience to request build-up.
– Optimizing network traffic to decrease load on network links.
– Improved instructions to users for self-service in case notifications are not delivered.
– Service restructuring to decrease service impact of network packet loss.

Enhanced monitoring for networking latency and various resource usage thresholds. This includes:

– Multi-region and multi-cloud targeted monitoring for the specific type of packet loss encountered.
– Improved monitors for additional types of resource usage.”