Random 502 errors for last 3 days (caused by illegal request header injected by CF reverse proxy)

was it only a IIS issue. I mean this response header in request was only sent to IIS or apache user as well?

I don’t think it only sent to IIS. But I’m not sure if Apache ignores it or give error like IIS does.

Likely Apache would ignore it, becuase even in IIS, if not using asp.net core, likely no error happens.

It was a Cloudflare issue, which seems fixed now. There’s a " Transfer-Encoding: chunked" header been added to request from some of the servers, but not all.

Thanks, though the question was for @stefano1 :slight_smile:

But I was the one telling what was happening to @stefano1 before they realized the issue.

Again, the question was rather for him and looking for an “official” statement.

I don’t think you will get one. But good luck.

I hope you’re right. As I have the same curiaricity that why this bug only happens to part of the servers. Also apart from adding a wrong header to the request, some of other headers also been lower cased.

It sounds like a failure canary release leaked some testing/debugging code…

Hey @sandro, really sorry here for the late reply,
That was not a Cloudflare, issue, but is a weird interaction with MS IIS server and the Hypertext Transfer Protocol Version 2 / HTTP/2 version.

I will try to share some more details in a second phase.

3 Likes

No worries :slight_smile:

And yes, please do share. Such insight is always appreciated. Particularly when it actually was a Cloudflare issue.
I noticed you said it wasn’t, but the origin connection is always on HTTP 1 anyhow and if that header really showed up in the request I’d say it was Cloudflare :slight_smile:. Anyhow, feedback is most certainly appreciated.

1 Like

I would love to hear a bit explanation why it was not a Cloudflare issue too :rofl:

@stefano1, would you already have some update on this?

@sandro @stefano1

Here’s the reply I got from my support case when I asked for a root cause / retrospective:

“The issue is related to testing we’re doing. These tests are currently being deployed on only a small subset of traffic currently while we look to improve the service ahead of a wider announcement. We apologize if these errors caused any issues for customers.”

Testing in production is fine. But you need to be ready to notice errors and revert quickly. And when you fail to do that (as is the case here), we need a retrospective to see what exactly what went wrong and what you’re doing to ensure that you catch the errors next time you start testing something.

So my guess above is correct. And they claim it’s not Cloudflare’s fault.

To clarify here - the issue has been identified as an interoperability issue with IIS10 and HTTP/2 - this has already been worked around in our software. We’re going to get someone from our Cache team to explain a bit more on here in due course. Sit tight :slight_smile:

5 Likes

sandro,

I feel thats why they are not answering. All request from cloudflare to origin is http 1.1/1. Dont know how http2 comes into picture.

@user3011,

Somehow I think I know why there’s http/2 in place. As from previous response, they’re doing some testing release That release may trying to become a http/1.1 → http/2 proxy. So the communication from CF server to web hosting server will be converted from origin 1.1 to 2. (But obviously first request still being in 1.1).

I can find something relate to this header from http - How http2/http1.1 proxy handle the Transfer-Encoding? - Stack Overflow. But I have a feeling this header should still be in the response, not request.

And to add a bit more detail here, @simon, this is not IIS 10, but the IIS 10’s ARR to Asp.net core causing issue. We didn’t see same issue in asp.net classic.
There’re some detailed discussion in .net regarding this issue, but I’m not sure if they’re relevant.
HTTP2: Disallow sending Connection: header and Transfer-Encoding: chunked · Issue #26926 · dotnet/runtime · GitHub

Hi everyone,

Alex here from the Cache Team. First, I’m sorry for any issues you’ve experienced related to errors as we are testing some new elements related to our CDN architecture. The support team is correct here in so far as there was an interoperability issue with IIS10 and HTTP2. Please keep reporting bugs you notice related to connectivity so that we can release a more fully-baked product and keep your eyes trained on the blog for an announcement in the coming weeks.

3 Likes

@akrivit Thanks for the additional details. Can you tell us more about what type of monitoring you do when you do testing in production like this? Given the impact that this had, I would have imagined you would have seen a spike in error rates. Do you do any type of monitoring along those lines? What are the thresholds that need to be hit for you to initiate a rollback of changes like this?

As you can tell in this thread, there was 5-7 days of significant failures happening on each of our sites. It took a significant amount of effort on threads like this one and support cases to get anyone to look into things on the CF side. What is being done to improve this going forward?

2 Likes

@taylor4,

They probably didn’t see any significant spike in error rates. Because 1. there’s only small amount of server being deployed, 2. only impacts IIS + Asp.net core through ARR, which only be small portion along with all other web hosting server types.

In term of owner of those sites with error, probably most of them didn’t notice this neither, as you will not see error log in hosting server. Unless there’re reports from user. But 502 is a very tricky error, as Cloudflare always blame it to the hosting server. And truth is in most of case, it was the hosting servers. But I agree with you, they need put other measure to monitor the canary releases. If you cannot verify it properly, it’s not a “canary” release.

Only thing I’m not quite happy here is, it wasted me entire day to investigate the issue, especially during a very busy period. And after I found the reason, they ignores me. @akrivit, check how long took your support team to response to my request, even after I inlcuded the detail of reason. If I didn’t start twitting around, the ticket will be kept on ignoring, or just closed as regular 502 timeout error.