Enabling a "cache everything" page rule increases concurrent sessions from CloudFlare to origin by 200%

In the past few months, we have observed 2400~3000 concurrent sessions from CloudFlare on our origin server (HaProxy).

Now, we have created this page rule (anonymized domain):

https://domain.com/stories/*?*cf=2*
Cache Level: Cache Everything, Origin Cache Control: On

When we enable this rule, the amount of connections from CloudFlare to our origin server ramps up from ~2500 to over 7000 in less than two minutes. When we disable this rule, the amount of connections goes back to ~2500. Multiple DevOps have tested this multiple times to make sure this specific rule is causing this surge in connections.

As far as we know, the rule is working correctly, as looking at the CloudFlare dashboard’s cache statistics shows that about 30% of the requests to /stories/* are served by CloudFlare without hitting the origin server. Theoretically, we would expect the amount of connections to go down, not up.

We did run some tests with Connection: close response header from our web server, this makes the amount of connections drop from ~7000 to about 50. However, this also increases response times by 100~150ms for our customers, as CloudFlare has to establish a new connection for every request.

This test seems to indicate that this behavior is related to Keep-Alive connections. However, we would like to better understand the reasoning behind this raise in connections when applying this rule. We would like to make better use of the cache, but we fear it may end up overloading our origin server if the amount of connections raise as we add new rules and cached routes.

  • Is there a specific reason for this behavior so we can better plan ahead?

  • Is this expected behavior from CloudFlare to keep so many alive connections due to this rule?

  • Is there any setup recommendations for our end?

Should it cache even the query string parts (which weren’t accessed before, or they are actually being accessed from some bots / vulnerabillity scan tool, etc.) too or?

May I ask which headers are set and sent out from your origin host /server regarding cache, so Cloudflare should respect them, if so?

Hello @fritexvz!

Should it cache even the query string parts (which weren’t accessed before, or they are actually being accessed from some bots / vulnerabillity scan tool, etc.) too or?

Oh I guess I didn’t provide enough details about this route, my bad.

This is an API route, query parameters include city_id and the route returns a JSON response with stories for the specific city. So we configured CloudFlare to cache the response based on all query parameters, for a small period of time. We have about 10k concurrent visitors, this is supposed to alleviate our server load by serving cached API responses directly from CloudFlare.

Response headers from origin server:

curl -sD - -o /dev/null -H 'host: domain.com' 'http://127.0.0.1/stories/3828?selected_state=26&gender_ids%5B%5D=2&selected_city=3828&cf=2'
HTTP/1.1 200 OK
cache-control: max-age=60, must-revalidate, public
date: Mon, 11 Oct 2021 15:21:17 GMT
content-type: application/json
transfer-encoding: chunked

As I said before, the caching is working correctly and I can see cf-cache-status: HIT when requesting through CloudFlare.

are you using option http-server-close ?

By default HAProxy operates in keep-alive mode with regards to persistent connections: for each connection it processes each request and response, and leaves the connection idle on both sides between the end of a response and the start of a new request. This mode may be changed by several options such as “option http-server-close” or “option httpclose”. Setting “option http-server-close” enables HTTP connection-close mode on the server side while keeping the ability to support HTTP keep-alive and pipelining on the client side. This provides the lowest latency on the client side (slow network) and the fastest session reuse on the server side to save server resources, similarly to “option httpclose”. It also permits non-keepalive capable servers to be served in keep-alive mode to the clients if they conform to the requirements of RFC7230. Please note that some servers do not always conform to those requirements when they see “Connection: close” in the request. The effect will be that keep-alive will never be used. A workaround consists in enabling “option http-pretend-keepalive”.

Hello @eva2000!

We did run experiments with option http-server-close and option httpclose.

Considering our infrastructure:

CloudFlare <-> HaProxy (Load Balancer) <-> php-fpm running in 70 computes

option http-server-close

  • HaProxy closes connection with the backend (php-fpm) but keeps frontend (CloudFlare) connections alive.
  • Doesn’t seem to make a perceivable difference in our case.

option httpclose

  • HaProxy closes connection with both backend (php-fpm) and frontend (CloudFlare).
  • Reduces current sessions from >7000 to ~2500.
  • Increases response times by 100~150ms for customers (because CloudFlare has to establish a new connection for each request).

Although httpclose “solves” the amount of active connections, the additional 100ms response delay is unacceptable to our product, as performance degradation would impact our SEO ranking. The product revenue is deeply tied to our SEO, so we can’t afford to increase response times.

Going back to the topic, we would like to understand the implications of these cache rules in the amount of connections, as we would like to cache more dynamically-generated API and HTML responses as the audience will likely grow from 10k to over 100k concurrent visitors in the near future.

Let me know if you need more information. :slight_smile:

Ops, small mistake in the numbers, it doesn’t seem I can edit the post:

  • Reduces current sessions from >7000 to ~2500.

It actually reduces current sessions from >7000 to about 50, but again, it’s not an acceptable solution due to the increased response times.

combine it with option http-pretend-keepalive then

By setting “option http-pretend-keepalive”, HAProxy will make the server believe it will keep the connection alive. The server will then not fall back to the abnormal undesired above. When HAProxy gets the whole response, it will close the connection with the server just as it would do with the “option httpclose”. That way the client gets a normal response and the connection is correctly closed on the server side.

It looks like a bugfix/workaround for certain clients and won’t really change the behavior of closing the connection with the client (CloudFlare), which would still result in 100~150ms response time increase as far as I see.

After reading the documentation numerous times, it seems the correct way to control keep-alive connections is through timeout http-keep-alive.

However, I did try several values (100ms, 50ms, 1ms) and none of them had any effect. The documentation then mentions timeout http-keep-alive only applies to HTTP/1.1:

When using HTTP/2 “timeout client” is applied instead. This is so we can keep using short keep-alive timeouts in HTTP/1.1 while using longer ones in HTTP/2 (where we only have one connection per client and a connection setup).

Now tweaking the timeout client from 60s to 5s, it finally made a significant difference: connections went down from 8000 to 1500.

It still doesn’t explain why enabling the page rule would increase the number of idle connections, but we have a way to control these if they become a problem.

It seems a higher amount of keep-alive connections is a good thing, as it increases the odds of CloudFlare reusing an already open connection and serving a response faster to our customers.

You’d have to balance latency response time versus scalability due to ephemeral TCP port exhaustion (IP/ports). If you want it to scale in terms of concurrency, you’re going to have to give up some latency for that to happen unfortunately.

No idea on the page rule though heh

Yep, with timeout client we can balance number of connections/ports and latency, it’s a fair compromise.

Still have no idea why the page rule would increase the number of idle connections though. It could be a symptom of CF making more requests to the origin server, but that wouldn’t make sense as CF is clearly serving ~1/3 of responses for the /stories/ routes without hitting origin server:

The rule applies to only 20% of the requests (241k out of 1.15m in a 30 minutes period), about 1/3 of the 241k being served by Cloudflare. Theoretically, it should have a small impact and decrease the total number of origin requests and thus the number of connections to the origin.

Without the page rule, the 1.15m requests per 30min are served with 2500~3000 concurrent connections.
With the page rule serving about 7% requests directly from CF without hitting the web server, we would expect to see fewer connections, but actually the amount of connections ends up raising to 7000-8000.

I was thinking maybe cache revalidation could be at play, because I see some 304 Not Modified responses when requesting from a browser through Cloudflare, but even then it wouldn’t make sense for revalidation to use more connections than DYNAMIC (uncached) routes as it was before this page rule.

You can also regain some of the loss latency if you tune PHP-FPM backend pools to ensure a more optimal PHP-FPM children process to CPU thread ratio to better service PHP children processes.

One thing that may cause this would be due cache misses due to CF CDN cache being per CF datacenter so in theory for a worse case, you can have up to 250 cache misses - 1 cache miss per CF datacenter. Depending on average cache TTL, that would determine how many cache misses are sent back to your origin over time. Though with CF release Tiered caching to all CF customers that should in theory have lessened connections to backend too https://blog.cloudflare.com/orpheus/. Unless you haven’t enabled CF Tiered caching?

1 Like

Oh, I didn’t know that Argo Tiered Cache is now free, thanks! We are closing our workday now and tomorrow is a holiday here, so I’ll get this info to the rest of our infrastructure team on Wednesday.

Argo seems tempting, I see it is supposed to decrease our cache misses. However, all of our target public is in a single country at the moment, and if Argo makes the DYNAMIC responses slower, that might be a problem – our dynamically generated HTML pages are not cached at the moment.

I’ll run some tests with the rest of the team on Wednesday and post results here then. :slight_smile:

Thanks for the info! :beers:

1 Like

Hello!

Here are the test results with Tiered Cache:

  • Stories cache (max-age=60) hit rate increased from 38% to 52%
  • Static assets cache hit rate increased from 81% to 89% (cache misses decreased by >40%)
  • Amount of connections and response times remain the same

So yes, tiered cache provides a significant boost to the cache hit ratios, and it doesn’t seem to negatively affect other factors.

The amount of connections resulting from the page rule is still a mystery, but it can be controlled through client inactivity timeout (e.g. timeout client 30s).

1 Like

Nice as expected from Tiered cache :slight_smile:

Are you sure the visitor traffic make up is same as usual or if you have some malicious clients holding up connections and increasing the connection count i.e. slowris type attacks?

Currently, the webserver only accepts connections from the Cloudflare IP ranges. Tests with option httpclose show that the actual number of simultaneous active connections is around 50, so the 4000+ connections must be keep-alive connections from CF.

I guess it’s not so bad after all if it allows CF to reuse connections and provide faster responses. Limiting the time of client inactivity or sending Connection: close response header (option httpclose) can solve the number of connections if it ever becomes a problem.

We are also able to add more webservers to the load balancer, and we intend to migrate our infrastructure to Kubernetes in the future, so I guess this is not an issue. I’ll check with the rest of the team before closing this topic.

Thanks for all the help so far!

1 Like