Twitterbot sends a flood of traffic

I’ve been following on this topic for a while and I noticed that this was a common issue without any input from the Twitter team. So, I thought I’d share again and get some feedback and ideas from the community here.

Twitterbot/1.0 keeps on sending a flood of traffic to certain pages causing an extreme spike in requests. Since I use Twitter Cards, blocking Twitterbot/1.0 is not the ideal solution.


Related thread on Cloudflare community

Other threads on Twitter community

https://twittercommunity.com/t/web-scraping-crawling-in-my-site-from-twitterbot-1-0/150273

https://twittercommunity.com/t/high-cpu-when-twitter-bot-visits-my-site/146637

https://twittercommunity.com/t/ddos-levels-of-website-traffic-generated-bytwitterbot/56831

https://twittercommunity.com/t/twitterbot-1-0-spam-ddos/16550

https://twittercommunity.com/t/twitter-bots-overwhelming-website/29273


Summary

Consequently, the only solution I ended up with is creating a Rate Limiting rule to block all requests exceeding threshold X in X seconds. This was a temporary solution. Any time Twitterbot/1.0 floods my origin with requests, I’m blocking it for few minutes. This solution works, but Twitter Cards will stop working the whole time the requests are rate-limited.


Hint

I’ve seen the below suggestion on one of the forums and I tried doing the same but I can still see blocked requests in my rate-limiting graph.

For anyone wondering, I found the solution. Via Cloudflare we use javascript challenges, and Twitterbot was not getting through the JS challenge and for some reason kept hammering the same pages over and over. Once Twitterbot was set to be allowed and not challenged, it stopped hammering our service.

According to Cloudflare sequence of firewall rules, and after applying the above solution, the requests are bypassing “Firewall Rules” and then getting rate-limited at the “Rate Limiting” rule.


[update]

Below is another new example of how Twitterbot/1.0 hammers the origin with requests. 12.21K concurrent requests in less than 5 minutes!


Any suggestions or feedback? I started believing that this is an issue from Twitter’s crawling bot and must be fixed from their end.

1 Like

Other than caching more content on Cloudflare’s edge or improving your server’s ability to handle requests. You can try telling it to behave in your robots.txt.

Yeah you can look at improving Cloudflare CDN cache/cache hit rate to offload some of the extra Twitterbot requests as @cscharff mentioned.

@rami.zebian how many requests are you getting and what’s the breakdown between amount of static file css/js/image versus HTML page requests? Do these requests have query strings?

@cscharff I already tried crawl-delay on robots.txt, as shown in the links above, but it looks like Twitterbot/1.0 does not respect it. Caching helps a lot but is a temporary solution.

@eva2000 I’ve got around 90% of the content cached but the URLs requested are not static. Caching helps as a temporary solution but the bot does not request the same URL more than once, It sends a flood of concurrent requests on multiple URLs (1 HIT for each URL). I shared some analytics in this previous thread.

Interesting and this is coming from many IP addresses? If so rate limiting might not help. If you’e on higher CF plan, you could probably using CF Waiting Room to cap the load to one that is manageable by your origin server https://developers.cloudflare.com/waiting-room/ I suppose and/or combine that with micro caching of dynamic pages on origin side i.e. if using Nginx + PHP-FPM, you could setup fastcgi_cache PHP-FPM micro caching for heavily targeted urls.

I assume you already filtered/verified by Twitter ASN so you’d be sure they’re legit requests? Seems like there’s not much info on why Twitterbot would act this way.

You can also extend origin side request logging to inspect the request headers to see if there are any patterns of request header side that you can use to come up with CF Firewall rules/Transform rules. On Nginx origin side you can use Lua Nginx or njs Nginx modules to log request headers.

@eva2000 It comes from one IP address but it keeps on changing. Rate limiting works for now, but when the requests are blocked for X hours, the Twitter Cards will no longer preview the page details on Twitter when shared.

You could probably using CF Waiting Room to cap the load to one that is manageable by your origin server

But, this approach will affect legit users too.

I assume you already filtered/verified by Twitter ASN so you’d be sure they’re legit requests? Seems like there’s not much info on why Twitterbot would act this way.

I verified that the IPs are Twitter IPs.

You can also extend origin side request logging to inspect the request headers to see if there are any patterns of request header side that you can use to come up with CF Firewall rules/Transform rules. On Nginx origin side you can use Lua Nginx or njs Nginx modules to log request headers.

I’m suspecting that Bot Fight Mode or some sort of JavaScript challenge is not letting Twitterbot/1.0 get through and for some reason, it kept hammering the same pages over and over again. I created a Firewall Rule to allow Twitterbot/1.0 but I can still see rate limited requests.

Here’s an example:

Knowing that traffic sequence is as follows:

Bot Fight Mode is severely limited as it doesn’t work with Cloudflare Firewall rules and would ignore your allow Firewall rule. In it’s current state for Bot Fight Mode, I usually disable it and use Firewall rules where possible.

2 Likes

This topic was automatically closed 15 days after the last reply. New replies are no longer allowed.