Googlebot DDOS?

So I’ve had a site thats been under a pretty heavy load lately. I have a dedicated server with a VERY respectable stats, with only 2 sites on it. Both of these sites are protected via Cloudflare.

For the last few weeks, my dedicated server, which usually runs at about a .9 1 minute average, has been running at anywhere between 30-70 constantly, for a couple of weeks. Yesterday, I finally had some time to dig into it. I found out that my tcp stack was overloaded, and that none of my outgoing traffic (recaptcha, emails, etc) was working.

I’ll leave the multitude of steps I went thru off of this explanation, as its pretty lengthy. One of the things I finally did (after I I implemented iptables tules to verify my traffic was ONLY coming thru cloudflare) was to install mod_cloudflare to get the original IPs from the incoming traffic. I hadnt bothered doing this at first, because I believed that if I was being hit from a specific ip or ip range, Cloudflare’s DDOS would have kicked in.

Yeah, I was wrong.

So, I found out that more than 95% of my connections was coming from ips ranging from 66.249.64.xx to 66.249.79.xx. And each one of those was hitting a dynamic page causing processing of both php and mysql climb thru the roof.

I went into cloudflare and edited the firewall rules of my free account, and blocked 66.249.0.0/16 (yeah, I know I could have been more exact, but at this point it was 4am and I wanted to nail this thing).

Boom. My traffic dropped and my box load came back down to less than 1.

Ok. So thats where I am right now. The traffic is blocked and my sites are fine.

BUT, I am now blocking Googlebot, which means my sites can no longer be indexed by google.

So, my questions:

  1. Was this actually Google? Has someone figured out how to weaponize the Google bot (like I said, this has been happening for several weeks at a minimum, constant)? Or can people spam and make it look like google?

  2. Someone said if I upgrade to PRO that the WAF has something that would automatically block this. Is that true?

  3. Someone ELSE said that this was definitely a DDOS and Cloudflare’s mitigation should have stopped it. This makes me wonder, though, if they have special logic that allows “known bots” unmetered access?

  4. Before someone asks, yeah, I turned on the “Im under attack” logic but that didnt seem to mitigate the traffic at all. It even went up higher at the same time I activated it (coincidence, Im sure).

So… What do I need to do here? I can upgrade to Pro of that will help, but I need some direction. My sites survive because of Google traffic and I cant leave that range blocked forever.

Any assistance would be appreciated.

Rather unlikely. These requests came from Google’s infrastructure but most likely from machines rented by third parties.

What you could do is still keep the block in place, however exclude Google itself from the block.

(ip.src in {66.249.0.0/16} and not cf.client.bot)

You have a higher chance there but no guarantee either. Cloudflare still often requires manual intervention, unless the IP address is on some blacklist already.

That would mean they either evaluate and execute JavaScript (hence solving the challenge) or they circumvented Cloudflare altogether.

As for the latter, you said these IP addresses only showed up after you installed mod_remoteip. Is that right? That would suggest these requests actually went through Cloudflare. If not, they might have connected directly and Cloudflare wouldn’t be able to block them in the first place.

  1. Make sure connections can only go through Cloudflare. That is something you need to do on your system’s firewall level.
  2. Extend your firewall rule on Cloudflare to what I posted earlier to block Google’s datacentres, but still allow Google’s web crawler.
1 Like

Hi Sandro,

Thank you for the detailed response.

Before I installed mod_cloudflare, it was showing all cloudflare ip addresses for all incoming connections. After installing mod_cloudflare and restarting apache, I was able to see the real originations which is what led me to google.

Firstly, know that I did enable local iptable rules that validate all incoming http/https traffic is coming from Cloudflare ip ranges. Plus, the fact that I was able to come into the cloudflare firewall (in the tools tab) and block 66.249.0.0/16 and the issue stopped shows the traffic is, indeed, flowing through cloudflare.

So, I went and created the rule in the firewall rules, EXACTLY as you have it listed. Took me a minute to figure it out, since I havent used that interface before. I created the entry, activated it, and then went into tools and deleted the ip adress ban there.

Load on my server IMMEDIATELY went thru the roof.

I had to go back to tools and put the ip range back in, in order to once again mitigate the traffic, here’s cut and paste:

(ip.src in {66.249.0.0/16} and not cf.client.bot)

action is set to BLOCK.

Here’s a section from my apache logs. It looks, to my, these all have GoogleBot in the user-agent (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

66.249.73.187 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057791/posts HTTP/1.1” 200 15708 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

66.249.73.172 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057833/posts HTTP/1.1” 200 15704 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

66.249.73.166 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057870/posts HTTP/1.1” 200 15703 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

66.249.73.186 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057885/posts HTTP/1.1” 200 15703 “-” “Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

66.249.73.160 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057834/posts HTTP/1.1” 200 15704 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

66.249.73.184 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057839/posts HTTP/1.1” 200 15704 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

66.249.73.191 - - [04/Jul/2020:03:06:15 -0400] “GET /find-new/2057705/posts HTTP/1.1” 200 15706 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” “”

Anyways, I still don’t know how to proceed. Any and all assistance appreciated :slight_smile:

EDIT: I removed the website from the end of each log line. Not sure why I thought I should do it, but I did. Shouldnt matter for the discussion.

That address resolves to “crawl-66-249-73-187.googlebot.com” and coupled with your observation that the rule did not block these requests, it would now actually appear as if these were requests from Google web crawler. That is a bit surprising.

Assuming this really is Google, you could try to rate limit them via https://support.google.com/webmasters/answer/48620

1 Like

Also, unrelated, check out mod_remoteip. mod_cloudflare works great but Cloudflare does not support it any more and mod_remoteip is a standard module as of Apache 2.4.

Hi Sandro,

Once again, thank you for your response.

I went and read the link you said, but when I click the link in there, it takes me to a page that tells me that my site is not recognized. It looks like its trying to take me to the old webmaster tools, not the new search console. And I cant, for the life of me, find something in there to let me limit the crawler speed.

Is there a support email for search console or for the crawlers? Guess thats what Im going to look for next.

Hmm. When I search for “remoteip” with yum, it shows me mod_cloudflare. Guess Im gonna need to do it manually.

With some crawlers one can specify a rate limit in robots.txt. You probably want to take the question now to a more Google specific forum however

https://httpd.apache.org/docs/2.4/mod/mod_remoteip.html

Also, just FYI, I picked a random time and found out google hit me 106 times in that one second. Im tempted to write a script to figure out the average hit rate.

Dont know how that many hits is even okay for any site. 100 hits per second???

A hundred is not necessarily that bad, but it certainly also depends on what your site executes upon each request. At this point I’d really look into how to rate limit Google and that should be it. Still interesting that they seemingly send that number of requests.

I ran the same check for the minute, and it was over 7000 hits in that one minute. If you notice, these are all “new-posts” queries, which are dynamic searches… so cpu and db expensive. I wonder if other people can request URLS for google to visit on my site. I guess if they link to them and then let google follow the links?

If these links are not valid someone might have pointed towards them.

believe that was me. CF Pro has WAF support and some rules are enabled by default while other rules are disabled by default, you may need to enable additional WAF rules to tackle fake search engine crawler bots

but seems your crawlers may not be fake

in that case inspect Google Webmaster Console search engine crawl rate/performance to see if there’s been an increase and https://developers.google.com/search/docs/guides/reduce-crawl-rate and https://support.google.com/webmasters/answer/48620

as well as coverage report https://support.google.com/webmasters/answer/7440203

looks like Google in max craw rate mode will consider highest crawl rate at 2 req/s = 120 requests per minute

Hi Eva2000, thanks for jumping in. It looks like I was not set up in webmaster tools before they switched to the new search console, so I am unable to do so now. I AM able to see the above page where I can change the crawl rate for some OTHER domains, but not for the one in question. Like so:

search_1

But when I choose ‘Add property now’, it shows me my domain already on the list. Like a catch22, it wont let me update it.

And I WISH it was only hitting me at 2 req/second. I did some spot checking and I’ve got some seconds that have over 150 requests that second. I believe I posted the info above. As Sandro pointed out above, I probably need to reach out to a more google-centric place for assistance. I just havent figured out who/where that is yet.

I believe Google has a few Google Groups forums dedicated to that topic, elsewise there should be plenty of web forums on that topic. Reddit might also be an option.

if it’s legit google bots then you can just tell the crawler via robots.txt not to index /find-new/ and see if it obeys

User-agent: *
Disallow: /find-new/

Other than /find-new/ path for what’s new/recent posts, Xenforo forums have other paths for crawler’s to index your forum threads anyway.

Cool. I’ll look for those.

Cool again. How often do they read the robots.txt? I mean, 10 seconds with the firewall dropped and my machine load spikes and begins my climb of death. Or is that spiral?

I believe Google fetches robots.txt at least once a day, however are these not legitimate URLs?

While the URLS are legit URLS, they are legit for live traffic. Nothing Id want archived or indexed. the sitemap I generate and submit have the proper urls.

My issue with the robots.txt, is that id need to re-open the firewall to allow google to eventually (and hopefully) read the robots.txt. Still dont know why I’m getting 100+ calls per second, so Im not putting faith in much.

In that case you could simply instruct Google not to index them by changing robots.txt. You wouldn’t need to completely open the firewall, the following should be enough

(ip.src in {66.249.0.0/16} and http.request.uri.path ne "/robots.txt")

Though I’d still try to rate limit Google and find out what exactly they are doing.