This is an unpopular opinion, and I get why – people crave a scapegoat. CrowdStrike undeniably pushed a faulty update demanding a low-level fix (booting into recovery). However, this incident lays bare the fragility of corporate IT, particularly for companies entrusted with vast amounts of sensitive personal information.

Robust disaster recovery plans, including automated processes to remotely reboot and remediate thousands of machines, aren’t revolutionary. They’re basic hygiene, especially when considering the potential consequences of a breach. Yet, this incident highlights a systemic failure across many organizations. While CrowdStrike erred, the real culprit is a culture of shortcuts and misplaced priorities within corporate IT.

Too often, companies throw millions at vendor contracts, lured by flashy promises and neglecting the due diligence necessary to ensure those solutions truly fit their needs. This is exacerbated by a corporate culture where CEOs, vice presidents, and managers are often more easily swayed by vendor kickbacks, gifts, and lavish trips than by investing in innovative ideas with measurable outcomes.

This misguided approach not only results in bloated IT budgets but also leaves companies vulnerable to precisely the kind of disruptions caused by the CrowdStrike incident. When decision-makers prioritize personal gain over the long-term health and security of their IT infrastructure, it’s ultimately the customers and their data that suffer.

  • r00ty@kbin.life
    link
    fedilink
    arrow-up
    18
    arrow-down
    1
    ·
    3 months ago

    I think it’s most likely a little of both. It seems like the fact most systems failed at around the same time suggests that this was the default automatic upgrade /deployment option.

    So, for sure the default option should have had upgrades staggered within an organisation. But at the same time organisations should have been ensuring they aren’t upgrading everything at once.

    As it is, the way the upgrade was deployed made the software a single point of failure that completely negated redundancies and in many cases hobbled disaster recovery plans.

    • DesertCreosote@lemm.ee
      link
      fedilink
      English
      arrow-up
      24
      ·
      3 months ago

      Speaking as someone who manages CrowdStrike in my company, we do stagger updates and turn off all the automatic things we can.

      This channel file update wasn’t something we can turn off or control. It’s handled by CrowdStrike themselves, and we confirmed that in discussions with our TAM and account manager at CrowdStrike while we were working on remediation.

      • r00ty@kbin.life
        link
        fedilink
        arrow-up
        6
        ·
        3 months ago

        That’s interesting. We use crowdstrike, but I’m not in IT so don’t know about the configuration. Is a channel file, somehow similar to AV definitions? That would make sense, and I guess means this was a bug in the crowdstrike code in parsing the file somehow?

        • DesertCreosote@lemm.ee
          link
          fedilink
          English
          arrow-up
          8
          ·
          3 months ago

          Yes, CrowdStrike says they don’t need to do conventional AV definitions updates, but the channel file updates sure seem similar to me.

          The file they pushed out consisted of all zeroes, which somehow corrupted their agent and caused the BSOD. I wasn’t on the meeting where they explained how this happened to my company; I was one of the people woken up to deal with the initial issue, and they explained this later to the rest of my team and our leadership while I was catching up on missed sleep.

          I would have expected their agent to ignore invalid updates, which would have prevented this whole thing, but this isn’t the first time I’ve seen examples of bad QA and/or their engineering making assumptions about how things will work. For the amount of money they charge, their product is frustratingly incomplete. And asking them to fix things results in them asking you to submit your request to their Ideas Portal, so the entire world can vote on whether it’s a good idea, and if enough people vote for it they will “consider” doing it. My company spends a fortune on their tool every year, and we haven’t been able to even get them to allow non-case-sensitive searching, or searching for a list of hosts instead of individuals.

          • r00ty@kbin.life
            link
            fedilink
            arrow-up
            3
            ·
            3 months ago

            Thanks. That explains a lot of what I didn’t think was right regarding the almost simultaneous failures.

            I don’t write kernel code at all for a living. But, I do understand the rationale behind it, and it seems to me this doesn’t fit that expectation. Now, it’s a lot of hypothetical. But if I were writing this software, any processing of these files would happen in userspace. This would mean that any rejection of bad/badly formatted data, or indeed if it managed to crash the processor it would just be an app crash.

            The general rule I’ve always heard is that you want to keep the minimum required work in the kernel code. So I think processing/rejection should have been happening in userspace (and perhaps even using code written in a higher level language with better memory protections etc) and then a parsed and validated set of data would be passed to the kernel code for actioning.

            But, I admit I’m observing from the outside, and it could be nothing like this. But, on the face of it, it does seem to me like they were processing too much in the kernel code.

      • daddy32@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        3 months ago

        There was a “hack” mentioned in another thread - you can block it via firewall and then selectively open it.