CrowdStrike has hired two outside security firms to examine its Falcon software thoroughly after a global IT outage. The company pinpointed a small coding mistake as the culprit in a root causes analysis [PDF] published this week.
A global tech crisis involving Microsoft and CrowdStrike caused mayhem last month, when an erroneous Falcon security update caused an outage impacting a then estimated 8.5 million Windows PCs. Microsoft has since addressed this with an automated fix, while CrowdStrike issued its own patch. An insurer estimated that Fortune 500 firms experienced collective losses of $5.4 billion. Following the crisis, CrowdStrike’s CEO apologized for the incident.
A recent report from CrowdStrike outlines the sequence of events. In February, a new detection feature was added to Falcon to block Windows interprocess communication (IPC) attacks. This feature went through usual development and testing before being included in Falcon’s version 7.11.
The Timeline of the Outage
By March, Falcon was remotely updated to detect new threats using IPC templates, stored in a file labeled 291. The Falcon sensors would fetch this file, and the Content Interpreter would process the data for detection purposes.
The problem was traced to the IPC Template Type, which defined 21 input fields, but integration code provided only 20 values. This mismatch wasn’t detected during several validation stages, including sensor release tests.
The July 19th Incident
On July 19, two new IPC templates were deployed, one requiring 21 inputs. This clashed with the Content Interpreter’s expectation of 20 inputs, causing out-of-bounds memory reads and system crashes on 8.5 million Windows machines.
CrowdStrike has applied a fix to prevent input mismatches. A patch for the Sensor Content Compiler was implemented on July 27, which verifies input counts. Additionally, runtime checks are now in place for input arrays, and similar fixes are being rolled out to all Windows sensor versions 7.11 and later, with a hotfix planned for release by August 9.
Legal and Financial Fallout
The outage has led to lawsuits from shareholders and customers. There is also a dispute with Delta Airlines over its recovery time, with Microsoft stepping in, suggesting that Delta’s delays involved non-Microsoft systems.
CrowdStrike aims to enhance validation processes to avoid future issues. Steps include ensuring input fields match at compile time, adding missing runtime checks for Content Interpreter fields, and expanding the scope of Template Type testing. Future changes will involve staggering template instance deployments and relocating kernel driver functions to user space.
Last Updated on November 7, 2024 3:23 pm CET