Hotlink prevention that still allows Google full access


#1

Hi, I am trying to prevent hotlinking of images from my various websites, however I do not want to block , hinder or in any way affect what google does.
I initially set up an htaccess file which allows google and other search engines to do what they want but blocks others from hotlink. This should have been ideal except that it does not work with cloudflare .

The generic block everything hotlink protection offered by Cloudflare is no good because it would interfere with google and it is absolutely critical that Google is unhindered. I would rather give up on the idea than mess with google.

Anyway, i have read everything I can find on the cloudflare community and it seems my best hope lies with something called “workers”. There is a sample of hotlink protection code at https://developers.cloudflare.com/workers/recipes/hotlink-protection/ but unfortunately this is of the block everything variety. Is there a template in existence which blocks hotlinking but allows google to hotlink?

Unfortunately I am not a programmer and there is little chance of me figuring out the necessary lines of coding. I can manage copy and paste and I can duplicate lines to add different search engines which is what I did with the htaccess. Anything more complicated is beyond me.

Any help appreciated.


#2

You can experiment with Firewall Rules. Something along the lines of:

URI Path contains jpg
AND Referrer is NOT your domain
AND User Agent NOT is Googlebot

Block

Hopefully someone double-checks my logic.


#3

Thanks, something like that might work. I am however, very nervous of experimenting with my main website as that’s where 100% of my income comes from. Im a photographer and so Google image search is critical. If my images are on the first few lines I sell them, if not I dont, and go hungry, so i need to be very careful that I dont cause a drop in rankings.


#4

:slight_smile: That should work, just two comments please: a) technically it wouldnt “unblock” Google but everyone who claims to be Google (though, that would be rather a crawler issue than a hotlinking one) and b) I wouldnt expect Google to send a referrer (particularly a third party one) anyhow. Are there occasions when it sends one?


#5

I’d probably go for a Firewall Rule set to BLOCK based on:

(http.request.method eq "GET" and http.request.uri.path contains "jpg" and not (http.referer contains "yourdomain.com" or cf.client.bot))

This leverages the cf.client.bot parameter to ‘whitelist’ all the bots Cloudflare is aware of. After all, I presume you’d want most indexers to see your images and this puts the onus onto CF to keep the matching updated as bots change their identifying fingerprint.

(Obviously replace yourdomain.com with your domain name (excluding any subdomain) and alter the ‘jpg’ pattern to whatever your image format is, or maybe use ‘path_to_your_images’ instead there.)


#6

Thanks for the info so far.

Just a thought, rather than try to block everything except search engines, would it be easier if I just blocked specific scraper sites?

Im not overly concerned about every single hotlink but I have more than 25000 images online and there are a few scraper sites which seem to hotlink all of them. Its these I am keen to block. I could list the out one at a time in an htaccess or similar.


#7

You were so far talking about hotlinking, scraping is a different issue and requires a completely different approach.

What is it you want to eventually address?


#8

By scraping, I mean third party operators who have a bot which goes to my website , gathers all my content and then regurgitates all the images as hotlinks to their own website, typically so called “wallpaper” websites.

Im not so concerned about the bot crawling the site. Im more interested in stopping the hotlinking.


#9

That is not so much scraping but really just hotlinking, in which case the approach outlined by @sdayman should work. Keep in mind you should also accept empty referrers. Basically every non-empty referrer not containing your domain should be blocked.

There are some ways for a site to stop the referrer from being sent, so that approach might still allow some hotlinking, but there are also browsers simply not sending a referrer. It really comes down to your level of “concern”. If you want to rule out that backdoor as well you’d really need to reject empty referrers too and require your domain in it.


#10

Mate, you’ll be at it all day if you go down the ‘allow all, block a blacklist’. As soon as a scraper sees they’re banned they’ll change user-agent. To be honest, there’s absolutely no approach that’ll stop scraping without impacting users too unless you wanted to implement rate-limiting (which is available via Cloudflare), say. If an image is online and someone can see it, then someone can scrape it if they care enough. Hot-linking however can be managed via the rule I quoted above.

As @sandro has pointed out you can get empty referers. The bulk of those I’ve seen are HTTPS sites linking to HTTP resources so if you run your site on HTTPS you shouldn’t have that much traffic. I’m not sure if catering for those would be good or bad so bow to his advice if he has experience with it. My gut says to not allow them which leaves my suggested rule as initially quoted previously.

With that rule in place you should be getting close to what you want hotlink-wise. You can extend the model by adding additional Firewall Rules to now take care of the scrapers as well, eg:

  • Maybe you don’t deal with certain countries - block countries hosting scrapers by using a block rule based on ip.geoip.country in { "country1" "country2"...}
  • Maybe these scraper sites have IP addresses you can ban with ip.src eq 1.2.3.4, or whole ASNs (ip.geoip.asnum) if the IPs change often.
  • Maybe add a rule to block the more common scraper tool user-agents such as those containing ‘curl’, ‘wget’, ‘HTTrack’, ‘Offline Explorer’, ‘Scrapy’, ‘SiteSnagger’, ‘TeleportPro’, ‘WebCopier’, ‘WebReaper’, ‘WebStripper’, ‘WebZIP’, ‘zgrab’ etc. etc.

#11

Thanks, blocking by country would do it. Most of the bigger and more annoying hotlinking sites come out of countries such as China or Russia and I sell very little outside the EU and English speaking world. Thanks :o)


#12

Just so you are aware of it, that will only block bots from those countries, which crawl your site and gather your links (not your content). This type of block will not be effective if they manage to get hold of your URLs in any other way.


#13

For the avoidance of doubt in case @sandro is referring to my first bullet-point above, a Firewall Rule using ip.geoip.country in { "country1" "country2"...} will block all access from those countries, not just bots. You’d have to combine that in a clause with cf.client.bot to just block bots from a specific country.


#14

I was referring to blocking countries when the presumed goal is to block hotlinking.