Exclude the Internet Archive (Archive.org) crawler from bot protection?

Hello all,

It looks like pages protected by “Cloudflare Bot Protection” can not be archived by Archive.org. For example see the 403 here: https://web.archive.org/save/https://spectrumcomputing.co.uk/entry/8249/ZX-Spectrum/HiSoft_BASIC (the original URL is https://spectrumcomputing.co.uk/entry/8249/ZX-Spectrum/HiSoft_BASIC - protected by Cloudflare).

I believe that 403 is because the site is protected by Cloudflare. Given the big service Archive.org does for preserving our cultural heritage, can we not block it by default? Which would be the best forum to raise such a request?

Attila

It’s on the verified/known bots list.

Which specific Bot Protection are you using? Is it this one?, or one of the Super Bot modes?

Sorry, I wasn’t clear. This is not my site, but rather run by somebody else. I already reached out to them, but also wanted to raise the issue here in case there is something more generic Cloudflare to make more sites accessible to Archive.org.

Actually, now that I’m looking at the description of “Bot Fight Mode”, I’m more confused: “challenge requests that match patterns of known bots” - why? I would expect that these known bots are more or less well behaved (or can be made so by robots.txt for example). I would expect from such a feature to mainly target “unknown” bots…

Anyway, is there a way to exclude a specific bot from the list?

It depends on which “list”. For good bots (The FAQ list), the bot owner can make a request:

But to be clear, Cloudflare’s default configuration won’t block many bots, so Archive would be able to get through. Beyond that, it’s up to the site owner to properly configure their firewall to block and allow appropriately.

Thank you for that link, I’ve reached out to Archive.org and asked them to fill out the form.

All the best.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.