Allow Google bot with rules

What WAF settings do I need to allow Google to crawl my website while maintaining rules to prevent scrapers?

Current settings are causing 403 errors in search console.

What I have isn’t working:

Field: user agent
Operator: equals
Value: Googlebot

Does this seem correct?

I thought there was an ‘allow’ option but now I see js challenge, skip (with a bunch of options).

I’m a little confused and getting deindexed pages :frowning:

By default, you should not need to allow Googlebot, as Cloudflare features are normally set to bypass the so called Known Bots, a list that includes search engines like Google, Bing etc.

The first thing you need to do is try to find which Cloudflare service is blocking Googlebot. Visit your Cloudflare Dashboard > Security > Events and filter events using Googlebot’s user agent (or part of it), IP addresses etc. Once you find events where Googlebot is being blocked you can the modify the relevant settings to prevent this from happening.

And yes, skip is the new allow.

2 Likes

Thanks!

I’m a newbie so thanks for helping.

I checked events and searched for ‘Google’ and found a bunch of their ips that were blocked by a challenge. I cannot see how my rules are blocking them. Instead, I allowlisted their ASN, and that appears to have worked. Now, with firewall rules on, Google is able to crawl pages in search console instead of giving me 403 errors.

Is it correct that allowlisting like this trumps my rule with js challenge?

Thanks
Marcus

I meant to say allowlisted not allowlisted. I added this rule.

Are you sure all requests coming from AS15169 are Googlebot’s? Allow-ing it would open your sites to requests that may include other Google services, among them services open to use by the public.

What you should do instead is create a WAF Exception for that specific rule which you identified as blocking Googlebot.

That looks like an Enterprise feature - I have business level only.

I see what you mean, but not sure how else I can manage the rule to allow appropriate bots. Should I add in specific IP addresses instead as a more secure method?

My mistake, I found the exceptions option in managed rules. Unfortunately, it doesn’t look like the settings are configured for my site or active.

I created a custom rule that causes the 403 issue with Google.

I am more confused than ever :frowning:

Would you be available to hire for some time to set up appropriate rules within Cloudflare?

It is confusing. The old regime (“WAF (previous version)”) is being replaced by the new regime. But in the process, the old rules were first transformed into rules of the new kind.

Anyway, since what is blocking Googlebot is a Custom Rule, you should just edit that rule and make sure it has as one of it’s conditions that Known Bots are OFF.

For instance, if you have on your rule that requests for URI Path equals /some-path should be JS-Challenged, you then add an AND logical operator for Known Bots OFF to that condition.

It is very good of you to offer all of this support. I am very grateful.

OK, so I have set it as you have directed. I continue to get a 403 error in search console when the rule is active. I have added below if this seems correct.


i

As you can see on your screenshot, there’s a little more distance between conditions linked with OR than those linked with AND. That means each condition linked with OR prior to the very last one will not also consider the condition linked with AND.

If
condition 1 OR
condition 2 OR
condition 3 AND
condition 4
then JS Challenge

Only condition 3 is tied to condition 4. You have two ways to make sure condition 4 also is applied to previous conditions:

  1. On the Expression Builder, repeting condition 4 for each prior condition linked with OR::

condition 1 AND
condition 4 OR
condition 2 AND
condition 4 OR
condition 3 AND
condition 4

  1. On the Expression Editor, grouping with ()s:

(condition 1 or condition 2 or condition 3) and (condition 4)

(BTW, your last condition will never match, because URI Full can never be equal to what ifree-car-check, it needs to be a full URL to match this field)

You should familiarize yourself with expressions, frields and operators when creating a Firewall Rule (aka Custom Rule). Those won’t change with the new WAF.

1 Like

Now this has a much better chance of performing as expected. I hope Googlebot stops complaining, and if it doesn’t, you’d need to check again the Security Events to see if any other service might be inadvertently blocking it.

Yes thank you.

BTW, if i am using URI full, do I need to include the full root URL string: mysitedotcom/freecarcheck as an example?

I have read your comments several times and it appears to be so.

You’re welcome.

Using URI Full with equals operator is rarely the best option.

When you use a URI field in the Expression it gives you an example of what that field would be matching against.

Here you see examples for the URI Full field:

That means a URI Full match (using URI Full with equals operator) will include a query string, if present. So if you match against the full URI with the equals operator, it would be easily bypass-able by adding a random query string. Instead, you can use the contains operator.

If using contains is not a good option, such as if you want to match a string that is part of your domain name (contains "example" for domain example.com would match all requests!), you can use other URI fields, such as URI (same as URI Full but without the “http(s)://” and the hostname), or the URI Path or URI Query.

Depending on what you want to accomplish, you may or may not want to have the full content of the field in your rule. A combination of the operator used (equals, not equals, contains, is in, etc.) with the content you provide will determine the match.

1 Like

I’ve learned a huge amount from your contributions. I am very grateful as I have been struggling so much. Thank you for all of your help.

Here is my final expression that Google seem to have accepted. The rules are active, and Google can crawl site pages now.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.