Top 50 user agents to block

Curious if anyone has developed and willing to share a list of the top 50 user agents to block?

Here’s a list from the perishablepress.com 7G .htaccess firewall:

(360Spider|acapbot|acoonbot|ahrefs|alexibot|asterias|attackbot|backdorbot|becomebot|binlar|blackwidow|blekkobot|blexbot|blowfish|bullseye|bunnys|butterfly|careerbot|casper|checkpriv|cheesebot|cherrypick|chinaclaw|choppy|clshttp|cmsworld|copernic|copyrightcheck|cosmos|crescent|cy_cho|datacha|demon|diavol|discobot|dittospyder|dotbot|dotnetdotcom|dumbot|emailcollector|emailsiphon|emailwolf|exabot|extract|eyenetie|feedfinder|flaming|flashget|flicky|foobot|g00g1e|getright|gigabot|go-ahead-got|gozilla|grabnet|grafula|harvest|heritrix|httrack|icarus6j|jetbot|jetcar|jikespider|kmccrew|leechftp|libweb|linkextractor|linkscan|linkwalker|loader|masscan|miner|majestic|mechanize|mj12bot|morfeus|moveoverbot|netmechanic|netspider|nicerspro|nikto|ninja|nutch|octopus|pagegrabber|planetwork|postrank|proximic|purebot|pycurl|python|queryn|queryseeker|radian6|radiation|realdownload|rogerbot|scooter|seekerspider|semalt|siclab|sindice|sistrix|sitebot|siteexplorer|sitesnagger|skygrid|smartdownload|snoopy|sosospider|spankbot|spbot|sqlmap|stackrambler|stripper|sucker|surftbot|sux0r|suzukacz|suzuran|takeout|teleport|telesoft|true_robots|turingos|turnit|vampire|vikspider|voideye|webleacher|webreaper|webstripper|webvac|webviewer|webwhacker|winhttp|wwwoffle|woxbot|xaldon|xxxyy|yamanalab|yioopbot|youda|zeus|zmeu|zune|zyborg)

2 Likes

Good list, thanks. I have deployed that but removed python and demon (those seem to block some RSS feedreaders, YMMV).

What I also have in place is this:

(http.user_agent contains “SemrushBot”) or (http.user_agent contains “AhrefsBot”) or (http.user_agent contains “DotBot”) or (http.user_agent contains “WhatCMS”) or (http.user_agent contains “Rogerbot”) or (http.user_agent contains “trendictionbot”) or (http.user_agent contains “BLEXBot”) or (http.user_agent contains “linkfluence”) or (http.user_agent contains “magpie-crawler”) or (http.user_agent contains “MJ12bot”) or (http.user_agent contains “Mediatoolkitbot”) or (http.user_agent contains “AspiegelBot”) or (http.user_agent contains “DomainStatsBot”) or (http.user_agent contains “Cincraw”) or (http.user_agent contains “Nimbostratus”) or (http.user_agent contains “HTTrack”) or (http.user_agent contains “serpstatbot”) or (http.user_agent contains “omgili”) or (http.user_agent contains “GrapeshotCrawler”) or (http.user_agent contains “MegaIndex”) or (http.user_agent contains “PetalBot”) or (http.user_agent contains “Semanticbot”) or (http.user_agent contains “Cocolyzebot”) or (http.user_agent contains “DomCopBot”) or (http.user_agent contains “Traackr”) or (http.user_agent contains “BomboraBot”) or (http.user_agent contains “Linguee”) or (http.user_agent contains “webtechbot”) or (http.user_agent contains “DomainStatsBot”) or (http.user_agent contains “Clickagy”) or (http.user_agent contains “sqlmap”) or (http.user_agent contains “Internet-structure-research-project-bot”) or (http.user_agent contains “Seekport”) or (http.user_agent contains “AwarioSmartBot”) or (http.user_agent contains “OnalyticaBot”) or (http.user_agent contains “Buck”) or (http.user_agent contains “Riddler”) or (http.user_agent contains “SBL-BOT”) or (http.user_agent contains “DF Bot 1.0”) or (http.user_agent contains “PubMatic Crawler Bot”) or (http.user_agent contains “BVBot”) or (http.user_agent contains “Sogou”) or (http.user_agent contains “Barkrowler”)

This list blocks about 20k (some days up to 50k) requests daily. Note that some on this list will block SEO services (no big deal if you are not the one requesting scans) and some social media monitoring services.

A company that keeps a presence on our forums mentioned their CRM-based monitoring tool stopped providing reports - and the company providing the service wouldn’t disclose the BOT name because it is a “trade secret”… Well, if their “trade secret” costs me money in terms of server and network resources, I don’t see why I would let them make money by selling data harvested from my services without a good justification.

4 Likes

Thanks, I had seen his list of 1200+, which was a bit overwhelming. This seems more manageable.

1 Like

Sorry for the double posting - just realised it would be a lot safer to use lowercase in all tests so my rule (with a couple of new bots) would be:

(lower(http.user_agent) contains “appinsights”) or (lower(http.user_agent) contains “semrushbot”) or (lower(http.user_agent) contains “ahrefsbot”) or (lower(http.user_agent) contains “dotbot”) or (lower(http.user_agent) contains “whatcms”) or (lower(http.user_agent) contains “rogerbot”) or (lower(http.user_agent) contains “trendictionbot”) or (lower(http.user_agent) contains “blexbot”) or (lower(http.user_agent) contains “linkfluence”) or (lower(http.user_agent) contains “magpie-crawler”) or (lower(http.user_agent) contains “mj12bot”) or (lower(http.user_agent) contains “mediatoolkitbot”) or (lower(http.user_agent) contains “aspiegelbot”) or (lower(http.user_agent) contains “domainstatsbot”) or (lower(http.user_agent) contains “cincraw”) or (lower(http.user_agent) contains “nimbostratus”) or (lower(http.user_agent) contains “httrack”) or (lower(http.user_agent) contains “serpstatbot”) or (lower(http.user_agent) contains “omgili”) or (lower(http.user_agent) contains “grapeshotcrawler”) or (lower(http.user_agent) contains “megaindex”) or (lower(http.user_agent) contains “petalbot”) or (lower(http.user_agent) contains “semanticbot”) or (lower(http.user_agent) contains “cocolyzebot”) or (lower(http.user_agent) contains “domcopbot”) or (lower(http.user_agent) contains “traackr”) or (lower(http.user_agent) contains “bomborabot”) or (lower(http.user_agent) contains “linguee”) or (lower(http.user_agent) contains “webtechbot”) or (lower(http.user_agent) contains “domainstatsbot”) or (lower(http.user_agent) contains “clickagy”) or (lower(http.user_agent) contains “sqlmap”) or (lower(http.user_agent) contains “internet-structure-research-project-bot”) or (lower(http.user_agent) contains “seekport”) or (lower(http.user_agent) contains “awariosmartbot”) or (lower(http.user_agent) contains “onalyticabot”) or (lower(http.user_agent) contains “buck”) or (lower(http.user_agent) contains “riddler”) or (lower(http.user_agent) contains “sbl-bot”) or (lower(http.user_agent) contains “df bot 1.0”) or (lower(http.user_agent) contains “pubmatic crawler bot”) or (lower(http.user_agent) contains “bvbot”) or (lower(http.user_agent) contains “sogou”) or (lower(http.user_agent) contains “barkrowler”) or (lower(http.user_agent) contains “admantx”) or (lower(http.user_agent) contains “adbeat”) or (lower(http.user_agent) contains “embed.ly”) or (lower(http.user_agent) contains “semantic-visions”) or (lower(http.user_agent) contains “voluumdsp”) or (lower(http.user_agent) contains “wc-test-dev-bot”) or (lower(http.user_agent) contains “gulperbot”)

2 Likes

I’ve not tried the “lower” syntax. Does it work in free plans as well?

Yes, it does. I have it on my Pro plan but also copied the same rule to my personal site on the Free plan.

3 Likes

This topic was automatically closed after 30 days. New replies are no longer allowed.