I’ve been following on this topic for a while and I noticed that this was a common issue without any input from the Twitter team. So, I thought I’d share again and get some feedback and ideas from the community here.
Twitterbot/1.0
keeps on sending a flood of traffic to certain pages causing an extreme spike in requests. Since I use Twitter Cards, blocking Twitterbot/1.0
is not the ideal solution.
Related thread on Cloudflare community
Other threads on Twitter community
https://twittercommunity.com/t/web-scraping-crawling-in-my-site-from-twitterbot-1-0/150273
https://twittercommunity.com/t/high-cpu-when-twitter-bot-visits-my-site/146637
https://twittercommunity.com/t/ddos-levels-of-website-traffic-generated-bytwitterbot/56831
https://twittercommunity.com/t/twitterbot-1-0-spam-ddos/16550
https://twittercommunity.com/t/twitter-bots-overwhelming-website/29273
Summary
Consequently, the only solution I ended up with is creating a Rate Limiting rule to block all requests exceeding threshold X in X seconds. This was a temporary solution. Any time Twitterbot/1.0
floods my origin with requests, I’m blocking it for few minutes. This solution works, but Twitter Cards will stop working the whole time the requests are rate-limited.
Hint
I’ve seen the below suggestion on one of the forums and I tried doing the same but I can still see blocked requests in my rate-limiting graph.
For anyone wondering, I found the solution. Via Cloudflare we use javascript challenges, and Twitterbot was not getting through the JS challenge and for some reason kept hammering the same pages over and over. Once Twitterbot was set to be allowed and not challenged, it stopped hammering our service.
According to Cloudflare sequence of firewall rules, and after applying the above solution, the requests are bypassing “Firewall Rules” and then getting rate-limited at the “Rate Limiting” rule.
[update]
Below is another new example of how Twitterbot/1.0 hammers the origin with requests. 12.21K concurrent requests in less than 5 minutes!
Any suggestions or feedback? I started believing that this is an issue from Twitter’s crawling bot and must be fixed from their end.