GitHub has confirmed that it experienced degraded performance for a second consecutive day, resulting from an update that significantly disrupted the platform's operations. The Microsoft-owned code collaboration service has reported issues primarily affecting Pull Requests, with users experiencing delays up to ten minutes. The delays have made it difficult for team members to see commits that have been made and pushed to branches promptly. The initial acknowledgment of the issue came at 23:39 UTC on March 12. Subsequently, GitHub announced it had identified a mitigation measure, and by 00:34 UTC, it declared the incident resolved, albeit without providing a detailed explanation at that time.
Root Cause and Impact
On the previous day, March 11, GitHub faced an outage beginning at 22:45 UTC, which lasted until 00:48 UTC the following day. This outage impacted several services, including Secret Scanning and 2FA via GitHub Mobile, which saw error rates soar to 100 percent before stabilizing at around 30 percent. Additionally, Copilot was affected with error rates reaching 17 percent, and the API errors increased to one percent.
We're seeing an elevated number of pull requests that are out of sync on page load. https://t.co/BreARYBWUj
— GitHub Status (@githubstatus) March 12, 2024
According to GitHub's Status History page, a deployment involving network-related configuration mistakenly applied to the wrong environment triggered these issues. While an attempt to roll back the changes was made within four minutes of detecting the error, it failed in one datacenter due to a previously unrelated issue that contaminated the configuration service's datastore. This required manual intervention to correct, with the full rollback finally restoring service by 00:48 UTC.
Future Preventative Measures
In response to these incidents, GitHub has committed to implementing various protective measures to avert similar future disruptions. These include enhancing the safety protocols for configuration changes, improving the monitoring of subsystems for quicker problem detection, and bolstering the resilience of its configuration system. Steps will be taken to prevent and automatically cleanse any corrupted records, aiming for an automatic recovery from similar data issues in the future.
GitHub's efforts to understand and rectify the reasons behind the deployment failures highlight the challenges in managing complex distributed systems. As developers and companies worldwide rely on GitHub for code collaboration and version control, the resolution of these issues and the prevention of future incidents remain a top priority for GitHub and Microsoft.