Verified Bots question

,

What is the name of the domain?

feeds.deans.lol

What is the error number?

403

What is the issue you’re encountering

Help setting up verified bots — keep getting error

What steps have you taken to resolve the issue?

I setup my own FreshRSS instance. In doing so I discovered the great aggravation of Cloudflare verified bots and such. Sigh. I miss the old days before all this nonsense was invented to aggravate people…

Anyway, all I want to do is grab the feeds I already do in Feedly. They succeed b/c they’re a verified bot. Most feeds from sites work. Those that use CF do not. In the process I discovered it’s the whole verify you’re human bot thing nonsense that no one, no one, likes. RSS is meant to be grabbed. Why is a site’s RSS feed locked away? But I digress.

Apparently I should apply as a verified bot. It seems mostly straightforward but I keep getting an error when I submit the form:
match pattern not matching user-agents

I’m not giving the form what it wants. I can’t quite figure out what that is. So I was looking for any guidance on what I’m supposed to do. I tried a few things.

The example is just GoogleBot so I tried my variation of that with no luck, and a few combinations around that idea. FreshRSS looks to send
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0)

but the form doesn’t like that either. I always get the match pattern error so I’m doing it wrong. Any help on giving the form what it wants would be greatly appreciated

&&

That most likely won’t work for you, if you’re just fetching a few RSS feeds.

See: Verified bots policy · Cloudflare bot solutions docs

NOTE: It will be completely up to up to the individual website owner, whether they will allow verified bots to pass through, … or not.

In other words, being a Verified Bot does NOT guarantee you access.

When that happens, it happens due to the website owner’s request, and isn’t something that Cloudflare randomly add here and there.

One reason could be that the website owner had a large amount of spam bots, that were registering new user accounts on their website, and in order to combat that, they added some extra tight security settings (eg. WAF rules) to their website, but during the kind of “rush” they were in, to fix the problem at hand, they failed to realise, that the rules they added would also hit the RSS feeds negatively.

And since the website owner see that the new WAF rules helped combat the spam bots, that were registering new user accounts, the website owner is now happy, as they solved the problem they had.

If you want further details than that, or to complain about being blocked on a specific website, you will need to talk to the website owner.

1 Like

Well, so I’m coming from Feedly. If I look in Cloudflare’s list, they are a verified bot and from my research, that’s how they work among many others as well. Like, if Feedly works on the sites I’m getting Cloudflare blocks, then if I’m verified I will work.

The issue is I can try to get this fixed in one spot - here in Cloudflare, the ultimate source of the problem. This would also future-proof me in the sense that if I add other sites that also use CF then I need not do anything else. Verified Bots would I think be basically a one-stop solution overall.

Or, I can contact many different web sites, and out of say 50 maybe 1 will reply. The others will ignore me or reply “lol you’re not Google so we don’t give a flip, good bye.” So really, being just me for myself, I will not ever end up in a fixed state. I will get maybe one or two sites to fix their stuff but then none others.

Yes, web site maintainers are the ones doing the configs sure but CF let’s the mis-config happen ultimately. They could put the pressure on the site owners to do correct configs, or give them the info as a gentle poke, or code their stuff to ignore RSS feeds. If website owners are en masse doing the wrong thing, CF could fix that by not letting it happen ultimately.

They are, for their infrastructure.

When you are setting up a self-hosted variant of their product, as you mentioned, that is not a part of their infrastructure.

There’s a possibility, depending on the website owner’s configuration, but as said above, there is no guarantee.

Again, Cloudflare isn’t the source of the problem. The website owner is.

What if you’re running a piece of RSS fetching software (it doesn’t have to be the self-hosted Feedly, as you mentioned, but it could be whatever), that is spamming my Cloudflare zone with requests, and ending up on overloading my server behind Cloudflare, because the software you ran, wasn’t being developed well enough?

Such kind of stuff has been seen many times over the years, and with many different kind of applications, ranging everywhere from the developers of applications that aren’t always doing the sane things, to strange software glitches, and so forth.

So if my RSS feed (or cache policies, whatever) told you to only re-try fetching after an hour, but that your instance ignored that, and was sending me 1500 queries per second.

Now, you can multiply that 1500 queries per second, with the thousands of different instances across the world, that other people have set up…

Shouldn’t I, as the website owner (and RSS feed owner), be able to defend against something like that?

We do now have two examples:

  1. Administrative negligence from the website owner.

  2. Website owner purposefully trying to get rid of queries, that matches the pattern of your queries.

How can Cloudflare know for sure, whether it is #1, or #2, that applies in a given situation, when the website owner has set up a WAF rule?

Knowing for sure which one of them applies, would be mandatory, for the statement “CF could fix that” to be valid.

Well, I mentioned Feedly because A) that’s what I’m moving off of and B) ppl will know that name and what it’s purpose is more than what I’m using, whic his FreshRSS. I am not setting up a self-hosted version of Feedly itself as I don’t believe they share their code or anything of that nature.

I mentioned Feedly is on the verified bots list https://radar.cloudflare.com/traffic/verified-bots since they are, and my research into this indicated that for CF WAF sites, being a verified bot basically let’s their traffic through. Which is what I had read in a few different spots. The gist was “You can contact every site and hope they fix it or try to get on CF’s verified bots program.” The latter is much better for me b/c I don’t have to contact multiple sites.

An example. I tried to talk to a site’s admins today. Let me tell you, the confusion and lack of understanding on what RSS is and why grabbing feed data shouldn’t be blocked was terrible. I don’t believe the guys I talked to understood at all. That site is going nowhere, I already know it. It simply won’t ever work for me. 98% of sites I would want to reach out to will be the same. I will basically never get most CF blocked sites to work going that route.

So if CF can design their WAF so big companies bots can work, then it would stretch the imagination to say they cannot allow little guys like me to work. I mean, they already have a system setup for that. Be a verified bot. That’s all I wanted to get help with.

I don’t care who is at fault. I say both CF and the website guys. Saying one or the other has no responsibility at all is disingenuous. CF can let me work, and for no extra work on their part already has a system that I can use - verified bots. This was obviously built for the likes of Google or Apple or whatnot that have enough money that CF wants their stuff to work well. The work is done. It exists. No dev time needs to be used here. It’s in place.

Website owners probably do also configure stuff wrong. I doubt they’re getting taught fully how to set things up. Maybe that’s on partner resellers? Maybe they work with CF directly on installation and setup? I don’t know. Doesn’t really matter. I’m sure they are doing something wrong too and yep, that’s on them partially.

But is it more logical to do a one-stop thing ( verified bots ) and be done and future proof any sites that I may want to add to my feeds? Or is it better to reach out to many many sites and go through the same routine each time, and mostly none of them will every do anything?

And true, any software at all, including from bigger companies like Google, can and DO have glitches in them. FreshRSS has been around long enough and is developed enough that they pretty well know what they’re doing for the most part and issues will get fixed in a decent timeframe. Sure, any and all software can and will have bugs. It’s a 100% guarantee from every single company or project out there.

All I wanted in this was a little help with a question. I was trying to fill out the form and I could not find what it wanted for the user-agent without it throwing a complaint out and refusing to submit the form. So, the original question: What does it want there?

All I’m asking for is to be given some info so I can use a tool that already exists and just go off on my happy little way and live life with my self-hosted RSS feed reader. Who is or isn’t at fault doesn’t matter. There’s a tool that is already developed and in production. It’s ready to be used. It is being used. I just need to submit a form, a form that hates whatever I was putting in and so knowing what it wants is all I need.

Correction: b/c I may not have to, or b/c I may eventually have to contact fewer sites.

I completely understand your goal, regarding the Verified Bots part.

Note: I don’t think it’s more logical to do, what appears to be completely impossible, before you even try it.

When I am personally looking at the policy, for becoming a Verified Bot, I don’t see that you will ever become a verified bot, the way you’re explaining your operations.

That policy is literally denying the possibility, to get the “one-stop thing”, that you’re actually looking for.

Browsers will identify themself, when they request a website (or their assets).

A browser like Firefox, may for example be sending the User-Agent string like this:

  • Mozilla/5.0 (X11; Linux i686; rv:136.0) Gecko/20100101 Firefox/136.0

Where Chrome, may for example be sending the User-Agent string like this:

  • Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36

The form is literally asking you, what User-Agent string(s) are you sending, to identify your bot(s) uniquely from others?

If your application is sending “User-Agent: RSSBot/jonnyabu-v1.2.3 (contact [email protected] in case of trouble)” to Cloudflare (or RSS feeds in general, while fetching them), then the form wants you to let them know, that this is the string you’re sending.

I do hope that this is helping you, regarding the User-Agent.

&&

You ALSO need to get accepted, which is where my scepticism is.

Awesome. Yeah I had put in the user-agent string my computer sends but it seemed to not like that. FreshRSS has the capability to put in whatever I want for it to send as the user-agent, so it can be anything whatsoever. I will play around with variations of your suggestion as that is helpful. I appreciate that.

I certainly shall hope they approve it. It is a terrible thing and a terrible look if they’re basically gonna say, “Nope, only those with enough money to grease our wheels get to get past our gate,” then that’s I guess I’d say unethical. I’m not saying CF is, I don’t know. Just saying, I shall hope not!