Massive rate-limiting issues with Worker in production

Hi there,

(edit: we’re on the paid Workers Unlimited plan.)

we’ve built a non-trivial worker (a smart caching layer for our global APIs) and once we pushed this worker in production we got many reports from customers about being rate-limited.

We’ve tried our hardest to fix this with the builtin Cloudflare tools (Custom Firewall whitelist, PageRules) and to disable all sorts of firewall features but to no avail.

I noticed that among the blocked requests in the firewall log were internal Worker requests (to our pseudo caching host, to our google pubsub logging endpoint, etc) and saw this in the documentation:

Cloudflare’s abuse protection methods do not affect well-intentioned traffic. However, if you send many thousands of requests per second from a small number of client IP addresses, you can inadvertently trigger Cloudflare’s abuse protection. If you expect to receive 1015 errors in response to traffic or expect your application to incur these errors, contact Cloudflare to increase your limit.

Is it possible we ran into this issue already? Is there hope to remove these restrictions for us? What’s the best way to go about this - we had to revert our production launch due to these issues. :-/

Many thanks in advance

To confirm, you’re on the $5/month paid workers plan?

Hey Judge :slight_smile:

The account affected is a different one (company account) than the one I’m posting with right now. We have Workers Unlimited and added a Pro plan for the domain once the problems became apparent (just in case).

Best

We’ve been using trivial workers for quite a while now (the bill has been around 60USD/mo including all requests).

This is the community forum, so we don’t have access to your account information, besides some CF employees.

As for the problem, workers shouldn’t be rate limited for regular users or browsers, perhaps unless you’re running an image board or have thousands of requests per page load being fired (unlikely).

I’d recommend contacting support via https://support.cloudflare.com/hc/en-us/requests/new from the account where you’re having issues. If you get an automated reply, reply to the email letting it know it didn’t resolve the issue. You could also post your ticket number here and an employee could make sure it hits the correct support channel.

2 Likes

We’re not running an image board but an API that sees quite some traffic as well (ballpark: 1M hits a day). The worker itself does query the KV store, interact with the cache API and emits sub requests as well.

Thanks for mentioning support, I’ve created a ticket already and was posting here in the hopes of a potential quicker resolution (knowing from the workers beta days that devs and PMs tend to hang out here as well). :slight_smile:

If someone from CF staff is reading this thread:

Thank you for contacting Cloudflare Technical Support. Your ticket number is 1780994. Soon, you will receive an email confirmation with ticket details.

Given the scale of our worker deployment and production load I wouldn’t be too surprised if we hit a sort of soft-limit (hopefully a soft limit). AWS Lambda has these things as well (albeit documented).

Thanks again for your time :slight_smile:

There is a not well documented rule that rate limits workers with subrequests. Mostly API users see this.

If you have an API user that is doing more than 2,000 requests a minute to your api for the same colo, zone and “eyeball IP” (and each of those is issuing a subrequest via your worker) you’ll be rate limited. The rate limit is basically 2k subrequests per minute per colo / zone / IP. Once you hit that you start serving 429s.

1 Like

And if you’re using Logflare, uninstall it. We have a workaround for this which should be rolling out next week. Basically though you’re going to have to keep counters, and stop issuing subrequests once you get close to 2k per minute per colo / zone / IP combo.

Oh, and supposedly you can get this limit increased, but support has told me the opposite, so I’m not exactly sure who to get in touch with to actually get this lifted for you.

Hi @berstend, I just wanted to confirm that contacting Support is the right course of action. They should be able to identify if you’re hitting our anti-abuse measure and lift the limit for you.

Also, I’m sorry to hear that this impacted your launch. :frowning:

Hmm, the 2k limit you’re referring to is indeed one that can be lifted, and Support is supposed to be the correct avenue to do so. If you were told otherwise in a support ticket, could you share the ticket number?

Harris

2 Likes

ThanksI This is the most recent: #1758982

(07:01:52 AM) [email protected]: Is it possible to disable rule ID 6 of rate limiting? I thought rate limiting was turned off on our account
(07:02:35 AM) Shanshan: I am afraid not, the rate limit is in place to prevent customer from making so many subrequests

Ah, logflare is from you :smile: I checked out the logflare worker source but by that time we already implemented our own very similar request logging (Catching responses > Pushing straight to Google Pubsub > Big Query > Datastudio).
As a workaround I would assume you want to fallback to collecting stats in buckets and emitting them in intervals when high load is detected? :slight_smile:

@harris thanks so much for your kind words :slight_smile: Our whole office was celebrating the launch of our new caching worker (to great success, the cache hits looked great) and having to revert it back afterwards was among the hardest things I had to do in my professional life :smile:

CF Support was super quick to respond, wanting to know more about the number of requests we intend to serve with the worker.

As this might be interesting to others I’m gonna quote myself here (ticket 1780994):

Hi Mike,
thanks so much for getting back to us :slight_smile:

We calculated a bit to give you a realistic estimate regarding our usage:

We’ve seen peaks of 60 req/s (= 3600 reqs per Minute, 5M per day), usually we get 2M API requests per day.

In addition our worker might do sub requests for those incoming requests, here’s the “worst case” scenario and the amount of sub requests:

Incoming requests handled by our worker (in a case with maximum sub requests):

  • +1 Request) 1 check with Cache API if is cached or not
  • +1 Request) Fetch new API result from one of our origin servers
  • +1 Request) Store that result in the Cache API
  • +1 Request) Emit this in the Google PubSub Stats endpoint
  • +1 Request) An error occurred at the end: Emit this to Sentry

PS: Some requests are proxied websocket requests (if this is relevant).

Thanks so much for your swift response, which is highly appreciated.

Please let us know if we can provide further information.

Our 3600 requests per minute requests (x5 for sub requests in the worst case) are global, so I’m not sure we necessarily hit this “2k per colo” limit. But this might very well be our issue. :slight_smile:

Thanks again to all who responded here, I’ll make sure to update this thread with any new developments.

1 Like

I’m conceptualizing a Plan B, in case the rate-limits cannot be lifted - might be of interest to other users interested in caching their API:

Update:

Unfortunately first level support couldn’t help us (still ticket 1780994):

If a worker attempts to make more than subrequests than the limit, fetch() will throw an exception resulting in an Error 1015. From what you have described below it does appear that you are above our threshold. Unfortunately, we are not at liberty to share the actual limit number for security reasons. I recommend reviewing the below documentation on our Worker Limits.

If you need more information from us, please feel free to respond to this support ticket with any additional information and we’ll be glad to assist.

I kindly asked her to escalate this issue to someone from the Worker team.

Hope this will end well for us. Would be a shame to throw away our amazing worker code :woozy_face:

2 Likes

Very weird that support says that they can’t help while the documentation clearly says to reach out to them for a limit increase…

1 Like

We’re going to be stopping short of the limit (so our users sites don’t return 429s) and logging a special meta message that we can alert that specific customer via email, so they can reach out to Cloudflare and make sure they’re not going to be limited.

Nice! This is what we’re doing except for the Pubsub part. We handle the requests directly. BigQuery is rad for structured logs.

1 Like

I wonder how one might be able to detect that limit or getting close to it. Observing fetch responses for the first 429 response might work.

The global scope of a worker is per-worker and not per colo (no idea if the number of workers per colo is known), so counting requests per worker would still not allow estimating if one is close to hitting the limit, right?

What I’ll probably be doing:

  • Collect stats globally per worker
  • Hijack the event.waitUntil of one request per minute
  • Emit batched stats once per minute

Unfortunately given how blurry this rate limiting limit seems to be documented I don’t know if that would fix our current rate-limit issues :slight_smile:

Oh, in addition I will wrap mission critical fetch requests (cache lookup, origin requests) in a handler that will check for 429 responses and retry them a couple of times with exponential backoff.

Not knowing these limits makes it really hard to code against them unfortunately.

Problem is it’s not those subrequests serving 429s. I don’t think the worker even knows this is happening.

Having those subrequests be the ones to return a 429 maybe would be a less intrusive way for Cloudflare to handle subrequest rate limits.

1 Like

I haven’t verified it yet but I think there’s a good chance rate-limited requests will return 429 within the worker and could therefore be caught.

My feeling is that as workers eventually return either a fetch request or response in a worker, that the rate-limit error is only then surfacing to the end user of the initial request.

This would be in line with what the 1st level support mentioned (assuming the Error 1015 is in the body and 429 the status code):

If a worker attempts to make more than subrequests than the limit, fetch() will throw an exception resulting in an Error 1015

How are you planning to implement this “oh god we’re close to being rate limited” feature in Logflare, if you can share? :slight_smile:

I was going to maintain a bunch of counters as global variables.

@harris can you just tell us the best way to handle this?

1 Like