The world is still recovering from one of the biggest IT outages in history, which can be traced back to cybersecurity software company CrowdStrike.
But what actually caused the outage that Microsoft is saying affected 8.5 million Windows machines? Dave Plummer, a retired software engineer from Microsoft, took to his YouTube channel to explain how the outage happened in a condensed, extremely informative video. Plummer explains that an operating system uses a Ring system to differentiate code into two distinct types, Ring 0 (Kernel) and Ring 1 (User). Kernel mode is for the operating system itself, and user mode is where system applications run.
Kernel mode consists of tasks such as communicating with hardware and devices, managing memory, scheduling processes, and other core functionalities. Kernel code has a higher privilege than user code, and user code will never run in kernel mode, while kernel code will never run in user mode. Another important distinction between application code and kernel code is when application code crashes, just the application crashes, while if kernel code crashes the entire system crashes.

A fault in the kernel code automatically triggers a system crash, as continuing system operations with faulty code may result in a worse outcome. This damage prevention technique isn't just on Windows and is present on MacOS and Linux.
So, what runs in kernel mode? Simply, only things that have to, such as device drivers, thread scheduling, and other privileged communications software across the system. How does this relate to CrowdStrike's faulty driver update? CrowdStrike's Falcon software operates at a kernel level as the company's software attempts to locate and identify new attacks through application monitoring. To be able to monitor applications from a reliable vantage point, CrowdStrike said its software needs to be at a kernel level.
CrowdStrike issued an update to this kernel-level driver for its malware-prevention software called Falcon, and within this driver update were files that caused a faulty instruction to be carried out that resulted in a kernel-level malfunction and ultimately a system crash, or the dreaded blue screen of death (BSOD). Unfortunately, we don't know why or how CrowdStrike rolled out this update to its Falcon driver, as the company hasn't revealed that information yet.

What is known, though, as Plummer importantly points out, is that the offending CrowdStrike driver isn't adequately checked internally for any possible faults before being rolled out to the public, or at least in this case, it wasn't. Moreover, some people have asked online why Windows isn't capable of detecting a critical driver failure and simply booting without it. Unfortunately, CrowdStrike marked its software as a "boot-start driver," which is a driver that is required to start the operating system. You can likely see where the dreaded blue screen boot loop comes into play here.
The marking of the boot start driver means physical intervention is required to fix the machine as the system needs to be booted into Windows Safe Mode, which runs on a limited number of essential drivers - excluding any CrowdStrike drivers. A user can then navigate to the CrowdStrike directory, delete the faulty drivers, and reboot the system without any faults. A simple yet extremely time-consuming fix if a large number of machines are relied upon for daily operations.