How to block Archive.org


#1

I tried to block Archive.org and Archive.is from accessing my website using 3 methods:

  1. robots.txt
    User-agent: ia_archiver
    Disallow: /

    User-agent: archive.org_bot
    Disallow: /

    User-agent: ia_archiver-web.archive.org
    Disallow: /

  2. .htaccess
    SetEnvIfNoCase User-Agent “^archive.org_bot” bad_bot
    SetEnvIfNoCase User-Agent “^ia_archiver” bad_bot
    SetEnvIfNoCase User-Agent “^ia_archiver-web.archive.org” bad_bot

    Order Allow,Deny Allow from all Deny from env=bad_bot
  3. Cloudflare firewall / User Agent Blocking
    archive.org_bot

But they keep on archiving my site.

Why do I want to do that?

  1. A violation of the privacy of my website visitor
  2. Traffic

#2

Ask them to exclude you. Having done so myself I can tell you that they’re responsive to such requests providing you can prove ownership of the site (normally ask you add something to the HTML etc. to prove you’re the webmaster).


#3

You can disallow that in robots.txt

User-agent: ia_archiver
Disallow: /

#4

Regarding point 1 in your ‘why’ section: How does the archive violate the privacy of your website visitors?


#5

As mentioned in my post I already did that.


#6

They archive also users names, photos and comments. The comments is not a privacy violation but names and photos are.
Today all users have to agree with the site’s privacy policy or not. But the user are not aware that their names and or photo is being scraped and stored by third parties sites like Archive.org.


#7

However, yesterday I’ve mailed the European Commission with the request to block the site in Europe because it does not meet the General Data Protection Regulation.


#8

They don’t reply. Thanks tho


#9

If it’s a public website, anyone can do whatever they want with what you post publicly. It’s not against ‘the rules’. Archive will most likely honor your request. Are you going to contact all the search engines too, image catalogers, etc?
The only way to protect your people is through a login system. If you choose to make some information public, it should be by the user’s consent. Think LinkedIn, Facebook, Instagram… users can choose to have their profile (or some content to) be private or public.


#10

“If it’s a public website, anyone can do whatever they want with what you post publicly”
That’s a false statement and not relevant to privacy violation of internet users.

“It’s not against ‘the rules’”
It’s a violation of the privacy law. Internet privacy involves the right or mandate of personal privacy concerning the storing, repurposing, provision to third parties, and displaying of information pertaining to oneself via of the Internet . Internet privacy is a subset of data privacy .

“Archive will most likely honor your request”
Time will tell.

“Are you going to contact all the search engines too, image catalogers, etc?”
Your arguments are very absurd and superficial. The website owner chooses which party (the search engine) he will use to register his website. He thereby agrees with the terms of service. The terms of service is also binding that must be mentioned under privacy policy. The website visitor is thus clearly informed how information is used. Also the website owner is free to block BOTS the website owner don’t want to allow and fact is Archive.org does not respect that. Anyway, the website owner has the option to hide the comment section for search engines but is still visible on the website.

“The only way to protect your people is through a login system.”
You move the responsibility to the website owner because Archive.org does not respect the privacy. Archive.org is not a search engine.

“If you choose to make some information public, it should be by the user’s consent. Think LinkedIn, Facebook, Instagram… users can choose to have their profile (or some content to) be private or public.”
You move the responsibility to the website owner and website visitor because Archive.org does not respect the rules. Archive.org simply must not scrape names, messages, and photos without permission. Simple.

My question is “How to block Archive.org”. I’m not here to debate with someone who denies the privacy law.


#11

I’m being realistic rather than idealistic. And caring about your situation trying to give you advice.
Its like letting a teenage daughter out of the house dressed provocatively. Sure, idealistically no one ‘should’ bother her… realistically, trouble will happen.

Your site will be cached and scraped by a lot of sources you don’t even know about. Following robots.txt is up to those services. And a lot of them don’t play by the rules. The big ones will, like Archive.

As a website owner you don’t feel responsible for the privacy of your users? If your users have issues with their information being at Archive or anywhere else, you’re going to be their first call. You might get into legal trouble.

CYA is all it comes down to. Don’t count on others to do the right thing. Protect yourself.


#12

The only thing we can do is read their FAQ:

https://archive.org/about/faqs.php#2

How can I have my site’s pages excluded from the Wayback Machine?
You can send an email request for us to review to [email protected] with the URL (web address) in the text of your message.

http://archive.is/faq

Why does archive.is not obey robots.txt?
Because it is not a free-walking crawler, it saves only one page acting as a direct agent of the human user. Such services don’t obey robots.txt (e.g. Google Feedfetcher, screenshot- or pdf-making services, isup.me, …)

You would also need to blacklist googlebot and bingbot from your webpages, since they both cache webpages:

And these are just good actors - you can prevent google webcache and the bing cache with the noarchive meta tag, however there are many closed-door scrapers saving content regardless of whether or not the site owner wants it saved.

Basically, if you don’t want content being accessed, you should implement privacy settings or restrict content via a password (as noted above). Restricting archive.is/archive.org/Google/bing won’t protect your content from scraping or just users using right click -> save as to save the entire page to their computer.


#13

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.