On Friday 19 July 2024, the UK awoke to news of a fast-spreading IT outage, seemingly global in its nature, affecting hundreds – if not thousands – of organisations.
The disruption began in the early hours of Friday morning in Australia, before spreading quickly across Asia, Europe and the Americas, with the travel industry among the most widely affected.
The outage was quickly tracked to cyber security firm CrowdStrike, which is already engaged in incident response amid the chaos. Keep on top of this developing incident over the coming days and weeks with our Essential Guide.
What does CrowdStrike do?
CrowdStrike is one of the world’s most prominent cyber security companies, with thousands of customers all over the world. Based in Texas, it employs more than 8,000 people and books about $3bn in revenues per annum. It has been around since 2011.
The organisation bills itself thus: “CrowdStrike has redefined security with the world’s most advanced cloud-native platform that protects and enables the people, processes and technologies that drive modern enterprise. CrowdStrike secures the most critical areas of risk – endpoints and cloud workloads, identity, and data – to keep customers ahead of today’s adversaries and stop breaches.”
CrowdStrike will be unfamiliar to most people not steeped in the technology industry, although Formula 1 fans will be aware of it thanks to its headline sponsorship of the Mercedes AMG Petronas team – its branding appears on the halo safety device and is clearly seen on onboard footage from Lewis Hamilton’s car.
Security practitioners will know CrowdStrike from its frequent contributions to major incident investigations, including the Sony Pictures hack, the WannaCry crisis, and the 2016 hack of the Democratic National Committee by Russia.
What happened during the CrowdStrike outage?
The disruption at first manifested in the form of the infamous blue screen of death – which signals a fatal system error – on Windows PCs.
Given the disruption appeared to be a Microsoft problem to begin with, it was Redmond that first responded, confirming just before 8am BST that it was investigating problems affecting cloud services in the US.
It quickly became apparent that the issue was not down to Microsoft itself, but rather a faulty channel file rolled out to CrowdStrike’s Falcon sensor product.
Falcon is a solution designed to prevent cyber attacks by unifying next-gen antivirus, endpoint detection and response (EDR), threat intelligence and threat hunting, and security hygiene. This is all managed and delivered through a lightweight, cloud-delivered and -managed sensor.
CrowdStrike’s preliminary investigation has now identified the source of the outage as a cloud-delivered, rapid response update to the Falcon sensor. CrowdStrike uses these updates to identify new indicators of threat actor behaviour, and improve its detection and prevention capabilities.
However, in this instance, a template containing “problematic” content data leading to the out-of-bound memory condition which was to trigger Microsoft systems to crash, was cleared for delivery thanks to a bug in CrowdStrike’s automated content validator tool.
The errors cumulatively caused what is known as a boot loop. This is a situation that occurs when a Windows device restarts without warning during its startup process – meaning the machine cannot finish a complete and stable boot cycle and, therefore, won’t turn on. Such issues will in general occur either due to inadequate testing across various desktop and server environments, or due to a lack of proper sandboxing and rollback mechanisms for updates that involve a kernel-level interaction.
At the time of writing, more information about the precise nature of the incident continues to emerge but the full facts have not been fully established, and an investigation will likely take some time.
Is there a cyber security threat from the CrowdStrike outage?
Though similar in its effect and origins to a supply chain attack, it is important to note that the CrowdStrike outage is not a cyber security incident and nobody is known to be under attack as a result of it.
However, as it affects a cyber security product threat actors will take advantage of the downtime caused and any gaps in coverage arising. This has already started to happen, within hours of the incident unfolding CrowdStrike itself said it identified a malicious ZIP archive circulating which purported to contain a utility to help automate recovery, but was in fact a so-called remote access Trojan (RAT).
Multiple national cyber security agencies, including the UK’s National Cyber Security Centre (NCSC) and partners in Australia, Singapore and the US, have also issued cyber alerts and advisories in the wake of the outage.
The coming days and weeks will see threat actors exploiting the incident in phishing and social engineering attacks as they attempt to lure new victims. Potential lures could include offers of technical support or bogus CrowdStrike updates, and the consequences could include data exfiltration, ransomware deployment and extortion.
Researchers at Akamai say they have identified well over 180 malicious domains exploiting CrowdStrike’s misfortune circulating, many of them incorporating keywords likely to be used by people searching for more information. Some of these websites are known to be linked to large-scale malicious phishing operations, and similar to email phishing lures, purport to offer information on technical support, fixes and updates, and even potential class action lawsuits.
Security and IT leaders and admins would be well-advised to communicate the potential follow-on dangers to their users. The most effective thing anybody can do is to only trust information and updates that come directly from CrowdStrike via its dedicated incident hub.
Who was affected by the CrowdStrike outage?
According to Microsoft, the incident affected approximately 8.5 million Windows devices worldwide, making up less than 1% of the entire estate. Redmond said that while this was a tiny number, all things considered, the widespread economic and societal impacts of the incident reflected the use of CrowdStrike by many organisations that run critical public-facing services.
The full number of organisations affected by the outage is not known for now. However, those that are known to have, or have confirmed they have, experienced some impact include:
- Airlines including American Airlines, Delta, KLM, Lufthansa, Ryanair, SAS and United;
- Airports including Gatwick, Luton, Stansted and Schiphol;
- Financial organisations including the London Stock Exchange, Lloyds Bank and Visa;
- Healthcare including most GP surgeries and many independent pharmacies;
- Media organisations including MTV, VH1, Sky and some BBC channels;
- Retailers, leisure and hospitality organisations including Gail’s Bakery, Ladbrokes, Morrisons, Tesco and Sainsbury’s;
- Sporting bodies including F1 teams Aston Martin Aramco, Mercedes AMG Petronas and Williams Racing, which were preparing to compete at the Hungarian Grand Prix at the time, and the Paris 2024 Organising Committee for the Olympic and Paralympic Games, which begin in a few days;
- Train operating companies (TOCs) such as Avanti West Coast, Merseyrail, Southern and Transport for Wales.
What is CrowdStrike saying about the outage?
In an initial statement, CrowdStrike CEO George Kurtz said: “CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyber attack.
“The issue has been identified, isolated and a fix has been deployed. We refer customers to the support portal for the latest updates and will continue to provide complete and continuous updates on our website.
“We further recommend organisations ensure they’re communicating with CrowdStrike representatives through official channels. Our team is fully mobilised to ensure the security and stability of CrowdStrike customers.”
In a breakfast TV interview with NBC in the US on 19 July, Kurtz added: “We’re deeply sorry for the impact that we’ve caused to customers, to travellers, to anyone affected by this, including our companies.”
He later added: “Nothing is more important to me than the trust and confidence that our customers and partners have put into CrowdStrike. As we resolve this incident, you have my commitment to provide full transparency on how this occurred and steps we’re taking to prevent anything like this from happening again.”
What is CrowdStrike doing about it?
Beyond rolling back the tainted update, which was done within the space of just over an hour and a quarter, CrowdStrike has since set out a number of changes to be made to ensure this doesn’t happen again.
CrowdStrike has now set out an extensive preliminary plan designed to keep such an incident from occurring again.
This includes improving the resiliency of rapid response updates through enhanced developer testing, update and rollback testing, stress testing, fuzzing and fault injection, stability testing, and content interface testing. It will also add better validation checks to its content validator system, and enhance some other components of its setup with improved error handling capabilities.
Going forward, updates made under the rapid response programme will be staggered, rolling out bit by bit across the installed base of Falcon sensors, beginning with what is known as a canary deployment. Party During this process, enhanced monitoring will be conducted on sensor and system performance, and as a last resort, customers will be given the ability to control the delivery of such updates, which will also be clearly set-out to them with release notes.
As of the weekend of 27 – 28 July, CrowdStrike leadership was reporting that 97% of the affected Falcon sensors had been successfully recovered. The company also told TechTarget Editorial that it was working towards rolling out a backend fix for the logic error that caused its automated validator to miss the dodgy code. This is expected to be made in the coming days.
What has been Microsoft’s response?
Microsoft has been working extensively alongside CrowdStrike to automate work on developing and pushing a fix, and in the wake of the outages hundreds of its engineers and software experts were deployed to work directly with customers on service restoration.
Microsoft has also been collaborating with cloud providers such as Google Cloud and Amazon Web Services (AWS) to share awareness on the impacts seen, and better inform ongoing dialogue with customers and CrowdStrike itself.
Subsequently, Redmond has fallen back on an EU anti-competition ruling from 2009 as a line of defence. This ruling holds that Microsoft must ensure the interoperability of third-party products with relevant software products on an equal basis. Ultimately, this means Microsoft appears to believe it was forced to allow CrowdStrike too deep into its core operating system.
Can I fix the CrowdStrike problem myself?
CrowdStrike has rolled back the changes to the affected product automatically, but hosts may continue to crash or be unable to stay online to receive the remedial update.
The short answer to the question is yes, but unfortunately, such issues can be daunting to fix, requiring IT teams to put in a lot of work. It may be days, or even longer, before all the affected devices can be reached.
System administrators are advised to take the following steps:
- Boot Windows into safe mode, or the Windows Recovery Environment;
- Navigate to C:WindowsSystem32driversCrowdStrike directory;
- Locate the file matching “C-00000291*.sys”. Delete this file;
- Boot normally.
CrowdStrike customers can access more information by logging into its support portal.
How can I avoid similar problems in the future?
Security firms such as CrowdStrike are under a great deal of pressure when it comes to product development and updates, which must be done frequently as they strive to keep their customers protected from new zero-days, ransomware and the like.
This pressure also trickles down to customers themselves, who will understandably often want to take advantage of settings to allow their security tools to update automatically.
To avoid falling victim to this kind of problem going forward, IT teams should consider taking a phased approach to software updates – particularly if they pertain to security solutions – and test them in a sandbox environment, or on a limited set of devices, prior to full deployment.
It is also wise to have some level of system redundancy built in to properly isolate and manage fault domains, particularly when running critical infrastructure. IT teams should also attend to IT asset management and software asset management, and establish strong disaster recovery and business continuity planning as a priority, and these should of course be tested regularly.
Those organisations that were unaffected should view the situation as a wake-up-call, instead of a lucky escape.