Stop reverse proxy scarping

Hi,

Someone is scraping my site using reverse proxy. It seems to be updated almost on the fly when I update mine. I have managed to 403 the site by blocking their IP in htaccess and also blocked the IP in Cloudflares firewall settings. However, it seems as though the site still slips through somehow, because if I sit and refresh their URL, eventually I will access it, albeit it seems unable to load css and images. I can still see that it’s been updated though with my newest posts.

Does anyone have any more suggestions on what I can do to combat this? I worry Google will continue to index the site, which it has done in the past as it’s even ranking - sometimes quite well which is just insane :frowning:

I’m not confident that I understood your problem; however, if you want to fight people stealing content from your website, you need to be aware of the following:

  1. It’s almost a lost fight; somebody can navigate through your site and dump your content manually.
  2. If the people attacking your site are using automated tools, you can certainly add some layers to slow them down or even stop them completely.

Cloudflare has some built-in tools, those are:

  • JS Challenge or CAPTCHA (to stop the most basic scrapers).
  • Super bot fight mode.
  • Rate limit, typically scrapers are very aggressive; you can surely tweak some rules to stop them.
2 Likes

As mentioned, they use reverse proxy. So when someone visits one of my URL:s, the mirroring sites URL is updated. That’s how I can find their IP as well because I can just add som bogus line after my URL like www.example.com/hackershavesmallpenises and then I can just check my logs and block the IP of the reverse proxy.

I was wondering if there was instead some automated way of dealing with this instead of me having to wait until the next similar clone appears on Google, at which point it’s too late because the clone has already been indexed and is ranking.

Would local analytics show you the IP address with a ridiculous number of requests?

This is the curious part. But if the IP addresses are all in the same ASN, that might be a way to implement a broader block. Maybe drop the challenge passage to something super low, like 5 or 15 minutes.

1 Like

There is not much you can do on the server-side. However, you can add some client-sided security that would force the attackers to take preventive measures against your scripts.
For example, add in your code a dependency to a js file that checks that the document location is your domain and the ref and other parameters match. They can easily tamper with this information, so obfuscating the script can be handy.

If this is truly a concern for your business/website, you can work around the concept of integrity checking and build a more complex solution.
Ultimately, since the proxy is not executing any of the code, you could detect an unwanted MITM by playing with asymmetric encryption and performing a cookie exchange.
The client can write to /challenge/ its encrypted IP; if you decrypt the value at your server and find that the IPs mismatch (incoming connection != solved challenge), the client is most likely proxied.

Finally, there are multiple solutions on the market that have solutions to this; however, their pricing is considerably high (>15k per year and over).

2 Likes

I only see one IP in the logs. The user agent is whatever person has visited my site but the IP will be the hackers. I have blocked that IP both in cpanel and in the firewall settings in CF. But if I refresh the mirroring site over and over, I can finally load it and I can see that it has indeed been updated with my latest posts, so it’s still slipping through somehow. I don’t see the blocked IP anymore in my logs as it has been blocked. I dunno if they use more IP-adresses that I haven’t found yet. However, their CSS and images are not loading for them anymore, so the block does have effect.

Interesting, although I’m not anywhere near the level of being technical enough to even understand how this works, let alone implement it. Might use the information if I approach a programmer though, so thank you.

When you say there are services, albeit expensive, that take care of this. Do you have any examples?

1 Like

I typically do not recommend any service in this forum, however, there are a lot of snake oil in the cyber security industry (specially regarding DRM/Obfuscation products), I’ll allow myself to make an exception :stuck_out_tongue: .

Jscrambler has two features that I believe would be useful for your case:

  1. Domain Locks.
  2. Runtime defenses.

Note that they have a startup package that is considerably cheaper, however, they typically reduce the protection strength significantly.

I take a quote from their site regarding the protection that you are interested in:

Preventing Scam Copycat Pages

Considering all the files of the scam page that we analyzed, we conclude that the JavaScript source code of the scam page is actually quite distinct from the source code of the official Celsius page. It appears that they did indeed copy/mimicked the HTML and CSS source code but did not copy any of the JavaScript logic of Celsius’ platform.

However, this is not always the case. If attackers for some reason choose to copy the entire source code, there are some preventive measures that can raise the cost of creating the copycat page. For one, protecting the website source code with runtime defenses will make it much more difficult for attackers to retrieve the source code. These runtime defenses not only prevent the usage of debuggers but also derail execution if the protected code is changed (which would be the case when attackers want to create a new scam landing page). Additionally, it’s possible to protect this same source code with domain locks that, when coupled with real-time notifications, trigger a critical security warning when someone attempts to run the code outside of the official company domain.

These, of course, are not foolproof measures, as it’s technically impossible to ensure that a website’s source code can’t be retrieved or reused by attackers. But the key takeaway here is that these security controls can vastly increase the cost of the attack to the point where scammers are more likely to move on to other targets.

1 Like

Thank you, I will look it up :slight_smile:

What I find odd is this…

I have blocked their IP so when visiting this mirroring reverse proxy site it’s not accessible. But if I sit and refresh their domain it will eventually load. I have added their IP in the CF firewall so how is this possible? When it loads, I can also see that their website has updated automatically with fresh content from my website. This also should not be possible since I have blocked their IP in htaccess and in Cloudflare. They don’t use another IP.

I even blocked their Cloudflare IP-adresses in the firewall. Cloudflare refuses to remove this site from their services.

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.