BUG AI Audit robots.txt rules - incorrectly saying bot allowed

Feedback

this robots.txt blocks GPTBot

User-agent: *
Disallow: /api/cwv.php
Disallow: /api/dummy2.php
Disallow: /wv/
Disallow: /foo/bar
# Disallow: /squashed-egg.txt
Disallow: /*ls/cwv*o*/a


User-agent: AhrefsBot
User-agent: BLEXBot
User-agent: barkrowler
User-agent: Zoominfobot
User-agent: GPTBot
USer-agent: Applebot-Extended
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
Disallow: /

However, in the AI Audit panel, it says GPTbot is allowed. (same for Applebot-Extended too)

Perhaps a caching thing, but it must be a very, very long cache, or the parser used for robots,txt is not quite right.

Ah, it appears is it a case to the parser not following the RFC 9309 spec,

User-agent: *
Disallow: /api/cwv.php
Disallow: /api/dummy2.php
Disallow: /wv/
Disallow: /foo/bar
# Disallow: /squashed-egg.txt
Disallow: /*ls/cwv*o*/a


User-agent: AhrefsBot
User-agent: BLEXBot
User-agent: barkrowler
User-agent: Zoominfobot
User-agent: Applebot-Extended
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
Disallow: /

User-agent: GPTBot
Disallow: /

Does lead to this tool detecting GPTBot as blocked. It would be great if it could follow the standards and correctly work with multiple user-agent lines.

My “AI Audit” is generally empty. Although AI bots definitely visit me. I have a free plan.
The contents of my robots.txt:

# AI
User-agent: AI2Bot
User-agent: Ai2Bot-Dolma
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: Claude-Web
User-agent: ClaudeBot
User-agent: cohere-ai
User-agent: cohere-training-data-crawler
User-agent: Crawlspace
User-agent: Diffbot
User-agent: DuckAssistBot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GoogleOther-Image
User-agent: GoogleOther-Video
User-agent: GPTBot
User-agent: iaskspider/2.0
User-agent: ICC-Crawler
User-agent: ImagesiftBot
User-agent: img2dataset
User-agent: ISSCyberRiskCrawler
User-agent: Kangaroo Bot
User-agent: Meta-ExternalAgent
User-agent: Meta-ExternalFetcher
User-agent: OAI-SearchBot
User-agent: omgili
User-agent: omgilibot
User-agent: PanguBot
User-agent: PerplexityBot
User-agent: PetalBot
User-agent: Scrapy
User-agent: SemrushBot-OCOB
User-agent: SemrushBot-SWA
User-agent: Sidetrade indexer bot
User-agent: Timpibot
User-agent: VelenPublicWebCrawler
User-agent: Webzio-Extended
User-agent: YouBot

# SEO
User-agent: SemrushBot
User-agent: dotbot
User-agent: MJ12bot
User-agent: ZoomBot

# Other
User-agent: Scrapy
User-agent: Mail.RU_Bot

Disallow: /

Sitemap: https://my-site.com/sitemap.xml

Can you share any logs showing which AI Bot(s) are still visiting ?

Now I use a blocking rule in WAF - (cf.verified_bot_category in {“AI Crawler” “AI Assistant” “AI Search”}), but even without this rule “AI Audit” was always empty.


Thanks @ivan.chupin.1973

Can confirm that is expected behavior AFAIK because AI Audit should only show “passed” or successful requests from AI Bots.

Hey Brian_M,

The issue here I reported is that the AI audit wouldn’t detect Ivan had blocked the AI bots in robots.txt, because it doesn’t detect User-agent grouping, as Ivan, and I had grouped many bots, with one rule, namely disallow.

So, you’d have to list each and every bot separately, with it’s appropriate rule, for the AI audit interface to detect you’d blocked it. Which is against how RFC 9309 spec is defined.

I can kind of see the argument that perhaps these bots don’t interpret robots.txt correctly either. I have no idea if they all do / should respect User-agent: grouping, but the idea of the tool seems to be to offer oversight and control for Bots that don’t respect correctly formatted robots.txt files, so it would, in my opinion, be a valuable addition to the rather handy AI audit tool if it did respect the spec, and allow you to monitor and / or block crawlers that didn’t.

Hi,
Can you reply on my issue?

Thanks for clarifying. I was not aware of this limitation previously.

Found generally two trains of thought.

  1. Best to cater to the lowest sophistication of crawler
  2. Support what major crawlers do. (Which generally does include support for “groups” of UAs.)

If you want compatibility today you can use the more primitive version with a directive per each user-agent. Various software can help to generate robots.txt and/or keep it updated. Many are platform dependent or you could implement txt dynamic file using Workers.

I am not sure about our plans for AI Audit parsing implementation long term. I will followup internally and circle back when/if I get more clarity.

1 Like