HomeWinBuzzer NewsGitHub Works to Restore Services After Prolonged Performance Degradation

GitHub Works to Restore Services After Prolonged Performance Degradation

GitHub down two days in a row due to bad code update. Pull Requests delayed by 10 minutes. Mitigation deployed but root cause not yet public.

-

has confirmed that it experienced degraded performance for a second consecutive day, resulting from an update that significantly disrupted the platform's operations. The -owned code collaboration service has reported issues primarily affecting Pull Requests, with users experiencing delays up to ten minutes. The delays have made it difficult for team members to see commits that have been made and pushed to branches promptly. The initial acknowledgment of the issue came at 23:39 UTC on March 12. Subsequently, GitHub announced it had identified a mitigation measure, and by 00:34 UTC, it declared the incident resolved, albeit without providing a detailed explanation at that time.

Root Cause and Impact

On the previous day, March 11, GitHub faced an outage beginning at 22:45 UTC, which lasted until 00:48 UTC the following day. This outage impacted several services, including Secret Scanning and 2FA via GitHub Mobile, which saw error rates soar to 100 percent before stabilizing at around 30 percent. Additionally, Copilot was affected with error rates reaching 17 percent, and the API errors increased to one percent.

According to GitHub's Status History page, a deployment involving network-related configuration mistakenly applied to the wrong environment triggered these issues. While an attempt to roll back the changes was made within four minutes of detecting the error, it failed in one datacenter due to a previously unrelated issue that contaminated the configuration service's datastore. This required manual intervention to correct, with the full rollback finally restoring service by 00:48 UTC.

Future Preventative Measures

In response to these incidents, GitHub has committed to implementing various protective measures to avert similar future disruptions. These include enhancing the safety protocols for configuration changes, improving the monitoring of subsystems for quicker problem detection, and bolstering the resilience of its configuration system. Steps will be taken to prevent and automatically cleanse any corrupted records, aiming for an automatic recovery from similar data issues in the future.

GitHub's efforts to understand and rectify the reasons behind the deployment failures highlight the challenges in managing complex distributed systems. As developers and companies worldwide rely on GitHub for code collaboration and version control, the resolution of these issues and the prevention of future incidents remain a top priority for GitHub and Microsoft.

Luke Jones
Luke Jones
Luke has been writing about Microsoft and the wider tech industry for over 10 years. With a degree in creative and professional writing, Luke looks for the interesting spin when covering AI, Windows, Xbox, and more.

Recent News

Mastodon