First, I want to clarify that I am not involved in performing QA on code.
However, I did run a division of the international Quality Assurance team at Sikorsky for the CH53K and the UH60 for almost six years. I have a Black Belt in Six Sigma and am an ACE Master Practitioner. For those unfamiliar with these terms, it means I received extensive training in improving quality, increasing efficiency, standardizing processes, and making data-driven decisions.
This experience also means I have significant expertise in performing root cause corrective actions—it was kind of "my thing." After reading the recent CrowdStrike RCCA (Root Cause Corrective Action) Report, I noticed that while it might make someone feel better about the issue because they see that a report was done, it doesn't go nearly deep enough to identify the root cause. Alternatively, they may have reached the root cause but chose not to share it in the analysis, which leads one to ask, then why share it?
One of the most useful techniques I learned to ensure helicopters carrying 20+ passengers didn’t fall out of the sky was the "5 Why's. It's a straightforward method where you just keep asking "why?" until you get to the root of the problem. Five iterations are usually sufficient, but sometimes you need to dig deeper. The point is, when you can't answer why anymore, you usually have found the root cause. (and no, I don't know isn't a reason to stop)
Here's an example from CrowdStrike's analysis:
Template Instance Stress Testing:
Observation: Template Instances were not adequately stress-tested within the Content Interpreter.
Why 1: Why were Template Instances not stress-tested properly? The stress testing process did not include a validation step for input field mismatches.
Why 2: Why was this validation step missing? It was possibly overlooked during the test planning and design phases.
And then they just stop.....
This pattern occurs multiple times in the report. Now that you understand the "5 Whys" technique, it's pretty clear that they should have asked "why?" again. Why was this validation step missing? They didn't answer; they simply said, "Well, it was POSSIBLY overlooked."
Imagine if, when building an aircraft, I said, "Well, it's possible that the fan was machined incorrectly," and then we just let the rest of the fans (the big fan that sucks in air and makes the engine go) be machined incorrectly because no one asked, "WHY was the fan machined incorrectly?" Yes, this is an extreme example, but crashing 9 million computers is extreme as well.
This issue recurs throughout the report, where CrowdStrike stops asking "why." Why was this overlooked during the test and design phase? The reason for repeatedly asking is that perhaps it wasn't overlooked; perhaps there was a different cause, but you'll never know unless you keep digging. This is fundamental QA root cause corrective action that applies to any QA, from jet engines to, yes, even code. This report should not inspire confidence; it seems like a rushed effort to placate stakeholders and to stem the bleeding rather than a thorough investigation.
Maybe the issue was fixed? However, they surely did not release a proper root cause corrective action report, and for a company that just experienced a significant incident, that was a crucial step they should have got right.
Now, I will say CrowdStrike did addressed several immediate causes of the incident and provided corrective actions to prevent those specific issues from recurring. However, they certainly did not fully apply the "5 Whys" technique (among others) to dig deeper into the underlying reasons behind these proximate causes (and that underlying reason WOULD be the ROOT CAUSE.. hence the name).
What they should have dug deeper on:
Investigated why their development and testing processes did not catch the parameter mismatch earlier.
Explored why critical runtime checks were omitted in the initial development phase.
Examined the root causes of why test planning and design phases did not cover all necessary scenarios.
Analyzed why the Content Validator was initially designed with insufficient checks.
The crazy part is, after investigating the above issues, they may end up needing to going deeper, they may have found a thread to pull on and found out the entire QA process isn't built properly. By not diving into these deeper layers, they may have missed systemic issues in their processes that could cause other, similar problems in the future.
As a former Quality Assurance manager, my only question is, why?
Here is the link to the release RCCA Report from CrowdStrike
Comments