AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

pelespirit@sh.itjust.works · 2 days ago

AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

LovableSidekick@lemmy.world · edit-2 2 days ago

You can detect pathpoints that come up repeatedly and avoid pursuing them further, which technically aren’t called “infinite loop” detection but I don’t know the correct name. The point is that the software isn’t a Star Trek robot that starts smoking and bricks itself when it hears something illogical.

Crassus@feddit.nl · 2 days ago

It can detect cycles. From a quick look at the demo of this tool it (slowly) generates some garbage text after which it places 10 random links. Each of these links loops to a newly generated page. Thus although generating the same link twice will surely happen. The change that all 10 of the links have already been generated before is small

LovableSidekick@lemmy.world · edit-2 2 days ago

I would simply add links to a list when visited and never revisit any. And that’s just simple web crawler logic, not even AI. Web crawlers that avoid problems like that are beginner/intermediate computer science homework.

vrighter@discuss.tchncs.de · 1 day ago

sure, if you have enough memory to store a list of all guids.

LovableSidekick@lemmy.world · 16 hours ago

It doesn’t have to memorize all possible guids, it just has to limit visits to base urls.

vrighter@discuss.tchncs.de · 6 hours ago

what part of “they do not repeat” do you still not get? You can put them in a list, but you won’t ever get a hit ic it’d just be wasting memory