Running a "good" crawler

Hello,
I’d like to generate previews for my app from user-generated links, to show them the same way Twitter and other social websites do it.

As soon as i request any page on any website that uses Cloudflare, however, i am blocked and served a javascript challenge.

Of course, i could bypass that easily with the plenty of libraries that are available for this very task, but i would like to play by the rules and write a proper crawler.

I tried changing the IP i’m requesting from - using my 4G hotspot, my home connection and a couple of different servers, but everything always leads me to that page.

I set my User-Agent correctly “mypreview-bot/1.2 (+https://mywebsite/bot.html)”, i’ve even tried to say what my intentions are by adding “(like TwitterBot)” [the same thing Telegram bots do]. I always read and obey the robots.txt file and do no more than a request a minute on any given website.

So, what am i doing wrong? The websites don’t even have “i’m under attack” enabled, so i guess Cloudflare just hates me.
The request is very simple:

**Host:** thewebsite
**Connection:** close
**Accept-Encoding:** gzip
**User-Agent:** mypreview-bot/1.2 (+https://website/bot.html)
**Accept:** */*

Am i doing something wrong with the headers? Do i have to apply somewhere for approval? Or should I just resign myself to needing to run cfscrape like all malicious bots do?

Thanks

3 Likes

Thank you very much!

I’d like to verify my crawlers using a list of IPs (“downloader”).
Is there a format specification to follow or would a text file with an ip+range (like 192.168.32.0/24) on every line suffice?

I think this is sufficient.

2 Likes

Thank you everybody. I’ve applied for the bot.

Will i get an email in case the request is rejected or accepted? How long does it usually take?

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.