• ⸻ Ban DHMO 🇦🇺 ⸻@aussie.zone
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    6 months ago

    It looks like ABC must have changed the internal layout of their pages for whatever reason. It seems like the bot is just selecting the first block quote as the entire article.

    On The Register for example it selects the div with the id #body. For ABC it seems that it looks for the class Article_Body which I can’t find on that article. I might have a closer look later if I’ve got some time and try to get a PR in if it doesn’t get fixed.

    • Rikudou_Sage@lemmings.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      6 months ago

      That’s the case, they removed one level of nesting from the html. Anyway, it doesn’t look for Article_Body class, but any class that starts with Article_Body. They’re using randomized class names with the prefix being constant, that’s why I have to do it that way. I’ve updated it to this horrible looking selector: div[class*="Article_body"] > div > p, div[class*="Article_body"] > div > ul:not([class*="ShareUtility"]) > li.

      • ⸻ Ban DHMO 🇦🇺 ⸻@aussie.zone
        link
        fedilink
        English
        arrow-up
        2
        ·
        6 months ago

        Thanks! I thought it might’ve been a wildcard thing but wasn’t sure. They really don’t want their articles summarised do they (or they’re probably trying to discourage AI scrapers)