Block if URL contains "x" word

Hi. I want to block spiders/crawlers, etc., from visiting specific pages of the site which contain a specific word. How can I achieve this with a URL string. For example:
www.mysite.com (document)

In other words, block access to the urls which contain “document” in the URL name. For example:
www.mysite.com/s/abc-document
www.mysite.com/s/12345document
www.mysite.com/s/345-document

To achieve this, you would need to use Cloudflare Firewall Rules.

I recommend reading for better understanding the below articles and also how to configure them:

There are a lot of types like by User-agent or some other.
Usually, if a crawler is, would be like “python-requests” or some SEO spider like semrush or other.

Moreover, at first sight, would be somthing in combination with:

(http.request.uri.path contains "-document" and http.user_agent contains "python") or
(http.request.uri.path contains "-document" and http.user_agent contains "MJ12bot") or
(http.request.uri.path contains "-document" and http.user_agent contains "Java") or
(http.request.uri.path contains "-document" and http.user_agent contains "wget") or
(http.request.uri.path contains "-document" and http.user_agent contains "curl") or
(http.request.uri.path contains "-document" and http.user_agent contains "[email protected]") or
(http.request.uri.path contains "-document" and http.user_agent contains "dotbot") or
(http.request.uri.path contains "-document" and http.user_agent contains "rogerbot") or
(http.request.uri.path contains "-document" and http.user_agent contains "SemrushBot") or
(http.request.uri.path contains "-document" and http.user_agent contains "Ahrefs") or
(http.request.uri.path contains "-document" and http.user_agent contains "RavenCrawler") or
(http.request.uri.path contains "-document" and http.user_agent contains "Screaming") or
(http.request.uri.path contains "-document" and http.user_agent contains "crawl") or
(http.request.uri.path contains "-document" and http.user_agent contains "spider") or
(http.request.uri.path contains "-document" and http.user_agent contains "backlink") or
(http.request.uri.path contains "-document" and http.user_agent contains "moz.com") or
(http.request.uri.path contains "-document" and http.user_agent contains "CriteoBot") or
(http.request.uri.path contains "-document" and http.user_agent contains "SEO") or
(http.request.uri.path contains "-document" and http.user_agent contains "go-http-client" and not cf.client.bot)

Just need to re-check …

But this would still allow them to access other parts of your Website.

Thank you for this thorough post. Couple of follow up questions, do I need to update the first part of the string with the actual site url? In other words, do I need to update this part?
“http.request.uri.path”

Second question, I see you wrote “-document” with a hyphen before the word. Can you tell me why this is necessary?

Thanks again for your help. I have been dealing with this issue for weeks.

The Firewall rule would look for a requests made which contains a word “-document” in an URL address that bot tries to access, just as you written in your first post.
No need to add the URL of your site - not full URL needed - only a part of it that contains “-document” and if it is a bot accessing it, will be blocked from an access to it.

You had given the examples as:

So, I assumed your documents would have “-document” suffix at the end, and that way you would know and block the bots from accessing them.
Is “-document” suffix for each document file or how your app is working?

Any real life example? URL, or domain name to test it out?

  1. Got it. Thanks!
  2. It’s not always -document, sometimes it may be just “document” so I understand what I need to do here (just include “document”).

I have a follow up question for you, in reality it is not just the word “document” but maybe 10 different words (document, pdf, resource, file, etc.). Is it possible to include all of these in the same line so that I don’t need to create potentially 100 different rules (one for each word). For example can these words be separated by a comma or what is the proper format to separate the words in the same line.

Thanks again in advance, this has been a real headache for sometime and I have a good feeling that your solution will take care of this problem. Once I understand if the words can be separated I will test out the solution.

1 Like

Just what I tought, regarding document (file) extension.
Yes, it is possible.

In that case, would be then like:

(http.request.uri.path contains ".pdf") or (http.request.uri.path contains ".xlsx") (http.request.uri.path contains ".docx") ...

Within a combination of an bot name.

Thank you for this. It seems I would need to repeat a new line for each document type though, correct? If so, I would have 10 http.request.uri.path contains “x” for each document type and bot type. I guess what I was trying to ask is if there is a way to have all the document types within one single line so that I don’t need 1 different http.request.uri.path for each document type and bot type. If so, assuming there are 10 different document types and 10 bot types, I would end up with 100 lines/variables. Whereas if I could put the different document types or matching “words” within a single line, I would only need 10 lines/variables total.

I am afraid you would need to have multiple lines.

Yes, it is possible using a regular expressions in Firewall Rules, for a value of contains like "^(Ahrefs|CriteoBot|SEO)" for bots or "^(pdf|docx|xlsx)" (or some similar due to dot file and extension - that would need to be re-checked with an Online Regex tool to be sure it catches the exact, if so) for files/documents, but I believe it works and requires an paid Business plan.

The Firewall Rules language supports parentheses ( ( , ) ) as grouping symbols. Grouping symbols allow you to organize expressions, enforce precedence, and nest expressions.

1 Like

Wow, thanks for all of this information. It is extremely helpful. I will digest it and test out some possible solutions but, one way or another, either one of the methods will help keeps these darn bots away or reduce their presence.

Thanks again and have a great rest of the week.

1 Like

I am happy to assist you :wink:

Hi…again. Having an issue. I am using Selenium Automation with Chrome to run a process on my own site but Cloudflare is blocking access once Selenium tries to log in (it says “checking your browser”…). I added the IP from which Selenium is accessing the site and the exact URL also and selected “allow” but Cloudflare keeps running its check because it sees that the browser is being run with automation. Any ideas how I can manage to stop Cloudflare from blocking my automation program?

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.