What can we learn from the Crowdstrike IT outage?

BlackHatMEA

The Crowdstrike global IT outage caused widespread disruption, with critical industries put at risk as some of their services stalled. Now, we’re seeing large corporations (like Delta Airlines) bringing lawsuits against Crowdstrike, as industries work to overcome the ongoing impact of IT failures.

We asked Yassir Abousselham (Founder and CEO at Silicon Valley Cyber) to share his perspective on the media response and the lessons we can learn from the Crowdstrike outage.

Here’s what he told us.

What was your impression of the response to the Crowdstrike outage – both the response we saw in the mainstream media, and the response you observed among the cybersecurity community?

“Both the cybersecurity community and the media reacted with shock at the impact that a single vendor can have across industries. This reaction is typical for large-scale events that serve as a sobering reminder of the fragility of our technology-dependent economy.”

What can organisations (and particularly those in critical industries) do to increase their resilience against third party outages like this?

“To improve resilience against similar outages, organisations, especially those in critical industries, should assess the potential impact of third-party software on their service availability and update their business continuity procedures accordingly.

“Open source and commercial software with elevated system privileges, particularly those receiving updates directly from the vendor or developer, are prime targets for such assessments.

“Organisations should adopt deployment strategies that allow them to catch issues before the update is promoted to production systems. Concurrently, other mechanisms should be considered to mitigate or transfer the risk of similar events. These mechanisms include:

Developing appropriate business continuity and incident response procedures for third-party-caused outages.
Implementing centralised asset and configuration management.
Testing system recovery scenarios.
Providing backup access to users through Virtual Desktop Infrastructure or Secure Browsers on personal devices.
Ensuring vendor agreements contain appropriate indemnifications for compromises and availability-impacting events.
Confirming cyber insurance coverage for third-party-caused outages.”

“OS vendors such as Microsoft also have the responsibility to ensure that third-party code with elevated system access (e.g., Ring Zero/Kernel/System/Root) is subjected to appropriate testing. While Crowdstrike certifies new sensor releases through Microsoft’s Windows Hardware Quality Labs (WHQL) program, channel updates bypass these tests.

“Considering the large number of vendors with kernel access, the more reliable approach would be for Operating System vendors, including Microsoft, to improve the resilience of their products by implementing further guardrails to prevent similar incidents from occurring.”

How concerned should we be about the possibility of an even more serious outage incident affecting critical industries?

“Given the increasingly interconnected technology ecosystem, there is no guarantee that similar events will not occur in the future.

“Specifically, most Endpoint Detection & Response (EDR) vendors, along with those in other product categories, have kernel access in Windows. A faulty update from any of these vendors could cause a similar incident.

“Our responsibility as technology customers is to learn from the CrowdStrike outage and deploy mitigations for similar scenarios. Additionally, we should demand better guardrails from OS vendors and transparency on testing, service availability, incident response, and product security from vendors with elevated system access and impact on critical services.”

Crowdstrike made it clear that the outage was ‘not a security incident or cyber attack.’ Do you think it’s reasonable to define an outage like this as ‘not a security incident’?

“By definition, security incidents or cyber attacks involve a threat actor and unauthorised access. As a security solutions vendor, it was important for Crowdstrike to provide accurate details about the incident to inform their customers’ response.

“Although availability is one of the pillars of the information security triad (Confidentiality, Integrity, and Availability), labelling this outage as a security incident would have triggered irrelevant incident response procedures, wasted valuable incident response resources, and eroded Crowdstrike’s brand image as a leading cybersecurity vendor.”

Finally, how do you think cybersecurity events like Black Hat MEA could facilitate the development of greater resilience among cybersecurity vendors and organisations?

“Black Hat MEA plays a crucial role in facilitating the exchange of information on organisational resilience. As the largest gathering of cybersecurity professionals in the region, the event serves as a forum where organisations affected by outages can share lessons learned.

Additionally, cybersecurity vendors have the opportunity to provide transparency regarding how they ensure their customers’ business continuity, alongside their threat mitigation capabilities. Information related to product security, testing, rollback practices, and incident handling procedures should be made available to support informed purchasing and risk management decisions.”

P.S. - Mark your calendars for the return of Black Hat MEA in November 2024. Want to be a part of the action? Register now!

AlexMillerJr

The Crowdstrike IT outage highlights critical lessons for organizations, especially in vulnerable industries. Here are the key takeaways based on Yassir Abousselham’s insights:

Recognize Dependency Risks
The widespread disruption from a single vendor underscores the fragility of our interconnected technology ecosystem. Organizations must assess how third-party software affects their service availability.
Enhance Business Continuity Plans
Companies should develop robust business continuity and incident response procedures specifically for outages caused by third parties. This includes regular testing of recovery scenarios and implementing centralized asset management.
Adopt Better Deployment Strategies
Implementing strategies that allow for testing updates before they go live can help catch issues early. This could involve using staging environments to identify problems before affecting production systems.
Vendor Management and Agreements
Organizations should ensure that vendor agreements include indemnifications for potential outages. Cyber insurance should also cover third-party-caused incidents to mitigate financial impacts.
Demand Better Standards from OS Vendors
Operating System vendors like Microsoft need to strengthen their testing and security protocols for third-party software that has elevated system access. This can prevent future incidents and enhance overall system resilience.
Be Prepared for Future Outages
Given the interconnectivity of technology, organizations should not assume similar incidents won’t happen again. Developing proactive measures and demanding transparency from vendors can help manage risks.
Clarify Security Definitions
It’s essential to differentiate between security incidents and outages. Crowdstrike’s assertion that the outage was not a security incident helps avoid unnecessary resource allocation and maintains their credibility.
Utilize Industry Events for Knowledge Sharing
Events like Black Hat MEA foster collaboration and information exchange among cybersecurity professionals, allowing organizations to learn from each other’s experiences and improve resilience strategies.
In conclusion, the Crowdstrike outage serves as a critical reminder for organizations to proactively address their dependencies on third-party vendors and enhance their preparedness for similar disruptions in the future.