Prevent a website from "reading" my content

Hi there,

I’m trying to setup a firewall rule that will prevent any traffic from the website www.justtherecipe.app from accessing my website.

I run a food blog and basically this website removes all the content from my site and shows just the recipe which causes me to lose revenue.

For example: Just the Recipe

I’ve tried blocking the IP address of the site in the firewall rules but have not had luck.

How can I prevent this website from fetching the content of my site?

Oh…them. I think they were not well-received on Twitter.

You’d have to look at the logs on your server to see what their requests look like. They might have a specific user-agent string, or a consistent IP address their scraper is using.

Yes…them :slight_smile:

So I took a look at the logs and it seems that the IP’s are coming from Cloudflare:

162.158.62.102 - - [07/May/2021:10:01:09 -0400] “GET /broccoli-potato-soup/ HTTP/1.1” 200 61663 “https://www.justtherecipe.app/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.51”

108.162.219.97 - - [07/May/2021:10:35:17 -0400] “GET /wp-content/uploads/2019/02/Instant-Pot-Vegetable-Quinoa-Soup-11.jpg HTTP/1.1” 304 0 “https://www.justtherecipe.app/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.51”

162.158.62.102 - - [07/May/2021:10:01:09 -0400] “GET /broccoli-potato-soup/ HTTP/1.1” 200 61663 “https://www.justtherecipe.app/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.51”

108.162.219.97 - - [07/May/2021:10:35:17 -0400] “GET /wp-content/uploads/2019/02/Instant-Pot-Vegetable-Quinoa-Soup-11.jpg HTTP/1.1” 304 0 “https://www.justtherecipe.app/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.51”

162.158.62.178 - - [07/May/2021:11:53:21 -0400] “GET /wp-content/uploads/2021/03/Cabbage-Quinoa-Salad-Feat-Image-Square-1200x1200-1.jpg HTTP/1.1” 304 0 “https://www.justtherecipe.app/” “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.51”

These IP’s seem to be coming from Cloudflare:

I’m guessing blocking these would be a bad idea?

Yes. A good idea would be to configure your server to restore the actual Visitor IP addresses.

Another good idea, and this should fix it, is to block the referrer domain with a Firewall Rule:

1 Like

Thank you! I will try this out and see if it works.

2 Likes

Interesting. I now see the real IP’s and when I test it it shows the request coming from my IP.
It looks like just the images are being blocked.

That is interesting, but not surprising. If there’s a referrer that says that recipe site, then that means the visitor is hitting the recipe site, but the internal recipe content is being requested by the visitor. I’m not sure how that site works, but the referrer block should stop that type of traffic.

It should block the text as well. The first request in your screenshot is for the broccoli potato soup page, and it has the referrer that should get blocked.

It could be that it scrapes the text once, then saves it. But they don’t want to rack up bandwidth charges on the images, so continue to serve them from your site.

Just keep an eye on your logs for more clues. I’d love to see how this works out.

1 Like

The block is working. As @sdayman pointed out, they have already many of your recipes saved on cache, but I got at least one that they couldn’t fetch:

1 Like

Interesting. I’ll need to wait and see if this is working them. Thanks for all your help @floripare and @sdayman

4 Likes

Funnily enough, you could surely mess with workers to detect their scraper and scramble the scraped data. I know that multiple sites do this with bots, instead of blocking them, they show different price tags for them :slight_smile:, you will likely get blacklisted for “trolling” and their scraper wouldn’t come back.

1 Like

That sounds like what Troy Hunt did with Coinhive.

3 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.