I like the idea of session affinity, so that under normal circumstances a user will continue to interact with the same server… but I was surprised to find how sticky the session affinity is. eg- I tested having the server I was currently connected to return persistent 5xx error codes, and no amount of refreshing caused the session to end and for the user to get rerouted to a functioning server. Mimicking a timeout yielded the Cloudfront timeout error page, but that still didn’t clear the cookie. Only the health check seems to trigger a change.
But Cloudflare could look at the actual user traffic to determine health (and to get more RTT metrics for dynamic routing in the meantime, btw). While I don’t expect Cloudflare to try to be too “intelligent” about gauging health (that could be a real rabbit hole), I would think there might be some basics that would make the user experience better, such as (a) detecting more than one 5xx code in one minute, or (b) detecting a server timeout (or other errors that trigger the normal Cloudfront error pages). Depending on the severity, perhaps it just deletes that specific user’s cookie, or perhaps it signals to mark that server as “down” immediately for overall pool management until a successful health check occurs.
Is this kind of “end user experience” focus on the development roadmap for load balancing? Are there some recommended approaches for approximating this functionality? (eg- using a Rule to detect the 5xx code in the header, though it’s not clear whether a Rule can currently clear the __cflb cookie) I guess I could try to get our 5xx pages to clear the cookie, but that wouldn’t help with timeouts or other more serious server issues.