Periodic Cloudflare Worker datacenter allocation issues?

By way of background:

Limited testing (10-20k requests per day), small number of users spread over a number of different ISPs/■■■, all GEO-IP as GB (they are in GB). Cloudflare free plan on the domain, with paid for Workers.

The vast majority of these GB requests are terminated in LHR (i.e. GB). Periodically however I can see the workers being triggered in (very) far flung DCs (e.g. Australia) at which point response times go through the roof - as both the users’ data must traverse the entire world to/from the Worker, along with any subsequent requests from the Worker itself back to API endpoints in GB.

I have manually tested based on the following:

  • Same source IP/AS
  • Based on the DNS response at the start time of testing - hard-coded one of the IPs in /etc/hosts (i.e. all requests for testing will be sent to the same IP)
  • Ping times (throughout the testing) to said IP are all ~6ms
  • Traceroutes, sent immediately prior each test (new TCP session per test), are identical

3 requests, sent within seconds of each other - 2 ended up in LHR and 1 in BNE (AUS :sob:). Observed response times correspond with DC (i.e. AUS requests = slow).

On a more general note - this appears to occur sometimes just for a few hours (e.g. through the evening, in GB) or sometimes for a day or more - then it can be weeks before it occurs again. If I had to guess I suspect it first started occurring within the last year/6 months or so (note: with no noticeable change in users/number of requests/etc). Most recently this occurred 2021-04-28 ~16:10 > ~22:20 (BST).

I accept that by attempting to build a service on “the cloud”, in this case Cloudflare’s edge worker infrastructure, I’m entirely reliant on said provider’s ability (or willingness, in the absence of a Business or Enterprise plan) to service those requests in a timely manner.

I do not however think it unreasonable to not expect requests to be serviced by Cloudflare’s edge at the very farthest point on earth from the users’ initiating those requests. “Nice service you have there - how would you like its packets to travel around the earth twice?

To the question at hand:

Has anyone else observed such wild mismatches between client GEO-IPs and CF Worker DCs?

Does Cloudflare’s DC Worker allocation logic/control have any regional awareness? and is this plan-specific?

Higher plan do get higher priority. I presume this extends to workers compute too.

For example most eyeballs in the UK were going to AMS or CDG in Feb and early March on the Pro $20 plan. On the free plan eyeballs in Spain were skipping Madrid and going to Canada and sometimes further afield. I suspect your worker observation will tally with watch -n1 https://www.yourdomain.com /cdn-cgi/trace. Recent data center recent upgrades have improved, but there are still more upgrades in the next few months, where there is contention at the nearest eyeball ingestion point the free plans and Pro plan are booted out to anywhere from the next country to 6000km away fairly quickly.

I appreciate Cloudflare have to manage their (finite) resources as they see fit - although the possible absence (even if only for free plans?) of any sort of regional considerations, from a cloud optimization provider when distributing traffic, does seem like an oversight.

During yesterday’s period (listed above) ~40% of the requests from GB users executed in non-EU DCs.

Yes I thought it was extremely odd too, esp the Madrid / Canada (not workers, just proxying) example given they have dozens of data centers throughout Europe. Tried reaching out to support but 99% of time they don’t understand the routing themselves.

I’ve not contacted support, I’m keen to understand other people’s real-world experiences (especially on Business or Enterprise plans).

I suspect in support discussions it may be easy to conflate the potential issue(s) with problems on “the internet” (DNS/Anycast/etc). To my limited understanding - if the routing path between the client and Cloudflare’s AS is persistent for the duration of a given test window, then any DC-flinging that takes place within their AS (e.g. once my test traffic has arrived at their edge) is down to design decisions on Cloudflare’s part and should be within their gift to resolve.

Paying to play to gain yourself a higher position on the please-do-not-boot-my-users-off-local-DCs scoreboard is understandable, it is the post-booting behavior that I am keen to understand (and more so if it is plan-specific). Irrespective of your plan - there is always the possibility Cloudflare need to spread your traffic to other DCs.

The problem occurred again on the 29th - 2021-04-29 ~18:40 > ~22:20 (BST) - ~70% of GB requests being executed in non-EU DCs*****. Of this 70% - ~60% Sydney, ~20% Osaka, ~10% Tokyo, ~6% Singapore.

It is entirely possible the real clients are being sent to these DCs by virtue of, and outside of Cloudflare’s control, EBGP Anycast issues. Although given my limited testing and observations (as above) during these periods - I suspect the edge-of-the-earth-DC-flinging is due to Cloudflare.

I don’t think many iPhone users ever thought Apple would intentionally slow down their phone as part of a software update. In a similar vein I do not imagine many people would consider that their Cloudflare-optimized traffic and/or Workers would end up banished to the edge of the earth. Though the comparison between the two situations is inexact, I believe from a mismatch-in-expectations perspective they may be quite similar.

*****Note: The percentage statistics are likely exacerbated upwards by virtue of keep-alive/sticky connections vs. relatively low number of users. That said, it wouldn’t be a problem at all if the traffic hadn’t ended up in APAC DCs in the first place.

setec, just to help your investigation, and maybe you’re already aware, the pair/triplet of IPs given out for your zone are different as you move up through the plans, this allows them to shift the load on a plan priority basis so it isn’t persistent at an AS level afaiu (this isn’t my area though).

My guess is its all delibrate, and the anycast IPs you are given, presumably on a different /24s (minimum announcement size?), don’t announce a path to regional data centers on the free/Pro plans when there is contention, as to why the next ingestion DC sometimes is 6000km away is an unknown though - possibly related to EBGP issues you mention (well outside of my knowledge here)

Also will help https://cloudflare-test.judge.sh/ and I’m sure you know about just suffixing /cdn-cgi/trace on to any cloudflare customer domain.

Keen to see what you learn as you’re taking a lot more of analytical approach than I ever consider, do share your results

1 Like

You could try the 1.1.1.1 trick in Workers-routed DNS records.

So, instead of putting 100:: (or other non-sense) with :norange:, put 1.1.1.1 as an A or one.one.one.one as a CNAME record with :ngrey: and try again. That way, I can connect to the local PoP in my country which is mostly unavailable to Free plan customers.

1 Like

In “normal” times I don’t have any issue with Cloudflare’s distribution of my Worker requests around their DCs. Though helpful to know, using a work around such as the above feels like (unintended) functionality that is likely to be removed by Cloudflare sooner rather than later.

Choosing to put all my eggs in someone else’s (i.e. Cloudflare Workers) basket, going into it fully aware of the possible risks associated with such a course of action, is one thing. Continuing to put my eggs in that basket when it has become apparent the basket may be tossed into the air with (increasing?) regularity, is another.

I think it may be prudent for me to seek (or build) out another Somebodyelsescomputerflare to fall back on should the non-regional DC-flinging increase (or my Workers be taken offline for 7 days without any help from support).

As an aside, and whilst it’s entirely possible I’m seeing patterns/behaviors where they don’t exist, the most recent DC-flinging episode (yesterday 2021-05-03 ~14:00 > ~22:00 BST) appeared different to the prior two (listed above) so I’m not sure if something has changed. Alas, the end result was still that some GB-clients’ Workers ended up being executed in Singapore (~17:30 > ~21:00 BST) :frowning_face: .

Any feedback gratefully received as to other people’s real-world experiences of Worker DC run-locations vs. client GEO (especially on Business or Enterprise plans). I am aware however this is reliant on actively reporting on/monitoring Workers, outside of the metrics Cloudflare provide in their dashboards, which I suspect may be a minority pursuit.

It appears the DC-flinging has now become a daily occurrence, albeit not always resulting in the Workers running in APAC DCs. Since my initial post I’ve found the following by kentonv (“Hey, I’m the tech lead of Workers”) on HN in March 2020:

we don’t do any special load-balancing for Workers requests; they are treated the same as any other Cloudflare request. We use Anycast routing (where all our datacenters advertise the same IP addresses), which has a lot of benefits, but occasionally produces weird routes

Points of note:

  1. The HN post is over a year old; things may have changed since then (re: Workers vs. other requests). @kentonv - are you able to expand on the current state of play re: Workers/Anycast/etc?
  2. Though it focuses on Anycast it doesn’t exclude the possibility of there also being a post-Anycast/higher-layer (non-regional-aware?) load balancer (DC-flingerTM).
  3. If it is only Anycast at work then I’m not sure how this correlates with my (limited) testing observations re: pings/traceroutes/etc (per my initial post).

No, nothing has changed since that comment. Whether or not you are using Workers does not affect what colo your traffic is handled in.

I’m not sure if there is any post-anycast load balancing. The Workers team doesn’t handle that layer. I would be very surprised, though, if Cloudflare is intentionally sending requests to the other side of the world.

Are you sure these requests are actually from GB? The Geo-IP could be wrong. In particular, if these requests come from another worker, then they may all have the IP address 2a06:98c0:3600::103. Geo-IP databases map this IP address to GB, but in fact the IP address just means: “Request came from a Worker that may have run anywhere in the world.” If this is the IP you are seeing, then it makes sense that you’d see traffic all over the world for it.

Thanks for the response. I’m fairly certain the clients are all in GB, although putting aside the other clients - my test client (initial post) was from a UK (GB) colo.

Having sent two separate requests, so closely together/one after the other, with one ending up in LHR and the next in BNE - I stopped any subsequent investigation as the problem appeared (to me) to be outside of my control.

I don’t believe traffic is intentionally being sent to far flung DCs by Cloudflare; it may be sent there unintentionally if a post-Anycast DC-flingerTM exists and it isn’t region aware (or lacks the concept of metrics/“distance”). The DC-flingerTM potentially being enabled when DC-booting is required (e.g. when moving traffic from a DC), a sort of belt’n’braces functionality alongside the removal of routes.

I’ll script up some automated tests (incl. traceroutes/etc) to be run so that my testing can take place over a prolonged period and from more GB sources. The problem didn’t occur yesterday, although had done so every day for the three days prior. It has certainly occurred more often recently - before this it could be weeks between each instance of the offending behavior.

In event there is anything to report as a result of my testing/observations I’ll add it to this thread.

setec, presuming this is related to the ingestion point, rather than ingestion to compute target or any specific worker routing, see if you can get some input from Jerome Fleury (Jerome_UZ) on this - he is network lead, he’s pretty receptive on the bird network, I’m eager to get definite clarification too, although I’ve only ever seen cross continent flinging (via cdn-cgi/trace) on the free plans so not currently a critical issue for us.

I guess my query can be summed up as:

  • Is there currently anything, other than network routing (Anycast) to Cloudflare’s edge, that would determine in which DC a worker was executed in? That is, if traffic arrived in LHR something internal to Cloudflare wouldn’t then send it to another Cloudflare DC.

I completely appreciate the limitations of free plans, and it would be unreasonable to expect anything else. My concern is that if such a DC-flingerTM exists - how does it operate and is this plan-specific. Irrespective of how much money is paid to Cloudflare, there is always the possibility (even if smaller than most, i.e. Enterprise) they need to move your traffic from a DC. If a “dumb” DC-flingerTM exists and is the same for all plans, I suspect this would be an issue.

“I’ve only ever seen cross continent flinging” - is this based on actively seeking out such traffic (across all types of accounts), i.e. external logging and analysis of Worker DC locations, or only in response to (e.g.) user reports of slow responses?

If I hadn’t fairly comprehensively “dashboarded” my Workers’ various metrics, I may not even have noticed. As it happens it was plain as day obvious what was going on (re: Worker DCs) although it’s entirely possible it may only have been so obvious due to the relatively low numbers of users.

If not actively looking for a problem, will it always show itself? Especially one which, until recently (possibly plan-specific), occurred so rarely.

I was in Madrid last month, and whilst troubleshooting an issue on a lower traffic site it just felt more sluggish than normal, I noticed traffic was being served from YYZ and another north american one instead of MAD via www.mydomain.com/cdn-cgi/trace. This property was due to upgrade to Pro plan anyway, and that got MAD back as the ingestion about 20minutes later. This is where I presume the anycast IPs served by DNS are different as you move up through the plan (I forgot to note)

On other properties we’ve seen AMS and CDG serving UK eyeballs on paid Pro Plans, which I thought was odd due to LDN and MAN being in the UK. Support said that this is a side effect of plan priority, only enterprise are guaranteed to get the nearest ingestion point (excluding China). Jerome Fleury clarified that a host of UK and European data center upgrades were to begin throughout March, and indeed early March we then noticed Pro plans UK eyeballs were back to LDN and sometimes even MAN.

You refer to workers, however my understanding is that the workers would be computed in the same data center as ingestion, so you would be affected by the above first as Kenton suggests “Whether or not you are using Workers does not affect what colo your traffic is handled in.” - it would be good to know for sure or not though if there is a second leg from ingestion to available compute pool that might see your worker execute in a different location than ingestion. tl;dr does ingestion colo always equal worker compute colo?

https://cloudflare-test.judge.sh/ Is dedicated to exploring which ingestion points are available per plan, iirc there something on cdn-cgi/trace that can identify which plan a property is on although obfuscated?

“tl;dr does ingestion colo always equal worker compute colo?” - yep, that’s it in a nutshell. Though the reason for it not being processed in the same DC may not be specific to Workers.

AFAIK DNS and subsequently Anycast will determine your entry point to Cloudflare’s (AS) border, once it has passed this border/edge routers it is entirely within Cloudflare’s hands as to how it is distributed within their network. It is possible (TBC) that traffic “landing” in one DC is shipped off to another DC; that is the root of my query - does such a thing exist (DC-flingerTM), even if it is only (and rarely) used for DC-booting.

The edge, of course, doesnt have to mean running in all 200+ data centers all the time. Weve also been able to use containers on the edge ourselves by running them in off-peak locations and for non-latency sensitive tasks. The scheduler for scheduled Workers, for example, runs on our internal container service. Since scheduled events dont have an end user waiting on a timely response, were able to run events in data centers where its nighttime and the traffic levels are low.

Another great use case is running CI builds on the edge, though not for the reason you think. Web traffic in any particular location goes through daily cycles. During off-peak hours, a lot of compute is not used. These off-peak locations would be perfect for running batch work like builds in order to maximize compute efficiency.

TLDR, cross-continent happens for free tier/non-user interactive tasks. CF has enough private line backbone capacity nowadays to tunnel your HTTP request over to another DC/Continent. Argo Smart Routing, KV, etc.

A few weeks ago my USA local DC (Newark) has a documented outage, for 45 mins my free plan sites were served from Melbourne with 250 ms TTFB. Paid plan was served from Ashburn/DC with under 100 ms TTFB. Ive also seen DC splits inside the same HTTP2 connection. My no worker static files are cached and is local (Newark), 50 ms later inside same H2 connection, my worker is in ORD (chicago) for an hour.

CF’s blog describes their Layer 4 and Layer 7 LB described Linux XDP being used to permanently redirect/NAT away (egh, anycast??!?!, its not NAT if dest IP still the same) a TCP socket away from an x86 CPU, to another server/another NIC in the DC (different DC?). They also hinted a server can issue a route update to their Juniper routers when needed. Its also been written all HTTP reqs/resps are parsed into Cap’n Proto packets, and then, well, any equip, any rack, any DC, can process the req, not the linux server at the end of a 10G peering link. CF TLS resumption, esp with H3, I forgot how CF does it, cell to wifi, the session ticket includes the original rack server, the new front end server must tunnel to the old front end server in the same DC. Wireguard same thing. IIRC if you switch from cell to wifi you dont start a new CF worker, you get tunneled to the old one in the same DC. Also CF might do games if your origin server isn’t in UK, you didn’t say above where the origin server is. You only said 99% of your customers are UK. CF could be saving some micro pennies by making EE/O2/Sky/Virgin transport to continental europe (or asia) and not do it on CF private backhaul. Where is your origin? Is your origin on another anycast network? :smiley:

Does CF think your origin is an anycast IP on AWS/Google/Azure and is equally fast in any DC anywhere on earth?

To summarize, if you are free, no promises about which POP. If you are paid, you need to open a support ticket.

In the instances of the most egregious increases in response times (e.g. UK>BNE>UK traffic):

  • The “origin” from a Cloudflare DNS settings perspective is X (AWS, Maxmind GEO-IPs to GB). This is irrelevant as far as the Worker is concerned as the Worker decides the true origin(s) (if any), although as you’ve pointed out may be used by Cloudflare in making (potentially flawed) DC decisions.
  • The “origin” as would be seen if you looked at the Worker’s network traffic is a Cloudflare Load-Balancer (i.e. paid service) IP(s).
  • The actual origins are servers in UK DCs.

Putting aside the basic distance/latency problem it is also exacerbated by the seeming lack of consistency of end CF DC location. Going from primarily one CF DC (LHR), which increases per-DC cache matches (I monitor this), to potentially tens of CF DCs in a day - doesn’t do much to improve cache response times.

It hasn’t yet been confirmed if a DC-flinger exists, and if this is “dumb” (or flawed “intelligence”) for free plans but not for paid plans.

Lacking an Enterprise plan I can’t speak to the veracity of the below (mis?)characterization, although this HN post (~9 months ago) doesn’t provide much confidence:

"I highly doubt it will get better as they grow. I think it will get worst actually as it does with most larger organizations.

Even with an enterprise support contract the idea of calling is completely discouraged to the point it’s hidden behind menus of finding a customized code to call support. Their e-mail support is a pain also because there are different techs responding with different ideas to solve a question. They don’t have chat support which is annoying as well in this day and age. If you want things done, their account management team gets it done fast when on boarding.

Although the product is really good and has some limitations which are annoying but if you can deal without support this is a great product. Also their post postmortems are amazing."

The more posts I’ve found on HN since starting this thread merely reinforces my view that having other/backup (non-CF) options available is the best course of action.

Upgrading to Cloudflare paid plans is my expected end state, whilst not assuming that in itself is the magic bullet; I should have more baskets for my eggs.

Deno Deploy is meant to be compatible with CF Workers - so that’s an option - you just have to migrate your page rules, transform rules, auto min to a worker first. Note they cheat quite heavily in their marketing though, they ingest traffic onto GCP premium network via Global LoadBalancer which has a healthy amount of edges but not as many as CF iirc, and this is sent to the nearest compute nodes per continent.

fly.io is also interesting with their wireguard mesh and anycast IPs

Minor update: this morning, ~02:20 - ~06:20 (BST), all my users had a pleasant Cloudflare (re)directed sojourn to AMS (maybe admiring the Parakeets in Vondelpark) before eventually returning to LHR.

As a layperson with no detailed understanding of Cloudflare’s traffic management practices this is the sort of thing I would expect: where required (e.g. free plan) clients being moved out of a DC, hopefully ending up in (ideally just) one or more local(ish) DCs for the duration. “Local” being in relation to the client from (ideally) a latency, or at least shortest AS-path/route, perspective.

In my case all users are in GB and in practical terms, off the top of my head, AMS would probably be my first choice (for a non-GB DC) to failover to.

Given the timings this feels more like something taking place within a maintenance window, albeit AFAIK not one scheduled/shown on cloudflarestatus.com, for which I would classify as DC-moving. This contrasting with my previously observed DC-flinging (documented above), taking place outside normal maintenance windows and possibly in reaction to specific DC(s) circumstances at the time (e.g LHR/DDoS/etc).

Note: It’s possible that since starting this thread DC-flinging has been improved to DC-moving, although I suspect not and it’s still two distinct behaviors.

1 Like