Broken links crawler overloaded the website's server CPU

Hello,

in order to find broken links on our website, we used drlinkcheck .com service.

when scanning the website (news website, over 7,000 posts) for broken links, it created CPU overload of 100% and even caused downtime of a few minutes (until we figured out it was this).

When checking the logs, we see requests coming from their crawler. average between 20-60 every single minute.

questions -

1 wasn’t Cloudflare WAF and Security suppose to block it for flooding? otherwise, I can use such services to target other websites to create an overload and cause damage, so how come Cloudflare didn’t respond?.

2 what’s the best way to prevent such overload in the future? of course, to be careful with legitimate bots like Google.

We have the PRO plan, using also Rate Limiting (for wp-login and xmlprc).

Thanks!!

anyone please?

You could also set rate limiting on your news posts if they really are that performance demanding. Sounds like you should also setup some server-side cache if you’re able to, getting your server overloaded with only 1-2 requests per seconds makes it sound like there’s room for quite a lot of improvement on your origin server.

1 Like

thanks for the reply. Using Cloudways Google Cloud. will check.
how can you rate limit news only? the URL structure is this domian.com/post-title for all website.
Rate Limiting sounds like a proper solution, however, I’m scared to block by mistake Google bots and/or any other legitimate bots

hello ? anyone can help with this?
maybe help set a rule for Rate Limit that wont block any legitimate bot (such as Google)?

You’re talking about a hit rate of one per 1-3 seconds. That’s truly a crawl, as a regular visitor will hit a page and initiate upwards of 100 hits in a second or two for all the CSS, JS, and Images. For a single visitor.

@arunesh90 already outlined the core issue. Your server can’t keep up with a below-average load.

To answer your first question, there is nothing in that crawl that would trigger WAF or Security. It was the equivalent of someone with a short attention span looking at a bunch of your site’s pages.

1 Like

thanks for the answer, but looking at the logs, there are 30 requests (most of them 404) in 5 seconds.
this is for sure an unusual behaviour.
I believe running few bots like this can cause damange to strong server (like we have)
and it’s for sure something we should block. can you please suggest what’s the best way? considering there are “good” bots (I dont see Google bot sending 30 requests in 5 seconds by the way)

thanks again

Not really.

So we know Rate Limiting won’t work because the crawler is not exceeding normal traffic rates. I just tried a Rate Limit of 100/minute and got locked out of my test site on the the third page I visited. That third page didn’t finish loading because it crossed the 100 resource threshold.

You can try Bot Fight Mode in Firewall -> Settings. That might slow down unwanted bots. You can also try Firewall Rules with a Challenge Rule for a low threshold Threat Score. You’ll have to determine that threshold through trial and error.

But as you stated, you initiated a valid series of requests to your own site and overloaded your own server.

thanks for your answer.
I just turned on the Bot Fighting Mode, didn’t know about it.

About the Firewall Rules, what number will be considered “low threshhold” score? also, will it challenge bots only, or also humans?