Corrupt download in some regions only

Hi there

I’ve been using CF successfully for years now. I’m using it as a cache for a 10MB file which gets a few thousand hits whenever I update it. The files have unique names. The service has been perfect for years.

A few weeks ago I did another release. Ever since (three releases later) I’ve received user reports complaining about corrupt files being downloaded. The files are always a few bytes short, the checksum fails. But for 99.9% of all users it’s flawless, as it always has been.

At first I thought the cache might be serving an invalid file. I flushed the caches and monitored the origin’s file. According to its access log file all requests have the same size to the byte. I must assume that CF did receive a correct file. But yet: the same users still experience the same problem. Even after updating the file with a new release the same users can’t download the new file either.

And here’s the most interesting part: all of them seem to be living in Germany. Would anybody know of issues which arose in a single region or dc only? Any other idea what might be wrong here?



Can you get your hands on the response headers which come along with the download?

Good point! I’ll try to do that.

Headers as logged by the application look ok:

"accept-ranges" => "bytes",
age => 93_918,
"cf-cache-status" => "HIT",
"cf-ray" => "58af094c4a44dfcf-FRA",
"cf-request-id" => "02614423b10000dfcf25a37200000001",
connection => "close",
"content-length" => 8_172_756,
"content-type" => "application/zip",
date => "Tue, 28 Apr 2020 07:23:10 GMT",
etag => "\"7cb4d4-5a43ed4d1cc40\"",
"last-modified" => "Mon, 27 Apr 2020 05:17:29 GMT",
server => "cloudflare",
"x-powered-by" => ["PleskLin", "PleskLin"],

Similar or identical to what I’d get using curl on my own machine:

HTTP/1.1 200 OK
Date: Tue, 28 Apr 2020 07:33:22 GMT
Content-Type: application/zip
Content-Length: 8172756
Connection: keep-alive
Set-Cookie: ...
Last-Modified: Mon, 27 Apr 2020 05:17:29 GMT
ETag: "7cb4d4-5a43ed4d1cc40"
X-Powered-By: PleskLin
X-Powered-By: PleskLin
CF-Cache-Status: HIT
Age: 94530
Accept-Ranges: bytes
Server: cloudflare
CF-RAY: 58af1841a8841f1d-FRA
cf-request-id: 02614d7d0800001f1d7a82c200000001

Alas, the application is reporting a timeout just about 1k short of fully downloading the file. It’s the first time I got this log, waiting for more. But from what I’ve seen in the past is that the downloaded file always was just a few (k)bytes short of the full file.

The file was served in both cases from Cloudflare’s cache. Also, the content length in both cases is identical. How big is the file you eventually got in both cases?

Some infos about the System the users which experiences this issues have would be good.

OS: Win/Mac/Linux?
OS Distribution:
OS Version:
Browser used: Chrome/FireFox/Edge/IE/Opera/
Connection: Cable/WiFi/Mobile?
Provider: Telekom/Vodafone/UnityMedia?

After this we could analyse this a bit

Hang on, are you saying the issue is not a successful download with a corrupted file, but an actual timeout? Is that a 524? How long does the download typically last?

Also, you said 10 megabytes and a few bytes missing. The file here is about 7 megabytes and now you were referring to kilobytes. So it is not just a handful of characters?

Can you post the link in question?

I only have one fully documented failure case so far. I’m still waiting for the others to report back. Aforementioned failure was using:

Platform: Raspberry Pi 3
OS: Raspbian (not sure about the version)
“Browser”: Perl based application (Logitech Media Server)

The status code given still was a 200. My initial number of 10MB was wrong. It’s around 8MB. The number given in the header is correct. If downloaded on my machine (macOS 10.15.4) using curl and piped right in to shasum I’d get the correct checksum. In above case the download of the file less 1403 bytes took less than two seconds, then another 10 before the application gave up.

As I can’t reproduce the issue myself (and I know there are thousands of happy installations) I’m slow to provide information. I have to get those three users to send me the correct log files etc. I’ll ask about the provider, too, as I think all three of them were based in Germany.

So you always get a 200? Where did that “timeout” occur then?

The first header excerpt was from your Perl application? Was the download cut short in that case? As the content length appears to be the expected one, that would suggest the content length does not reflect the actual content. Could there be any obscure TLS issue on your Raspberry, which cuts off the connection towards the end? Just speculating :slight_smile:

Can you provide the link?

Yes, first snippet is Perl’s.
The application is ending the download after 10s of inactivity.
Yes, it’s the Perl application which doesn’t receive the full file. And yes, it’s reporting “Timed out waiting for more body data, returning what we have”.

The platform is a good question. But I believe one of the reports was on Windows…

$ curl -s | shasum

Can you also run a cURL call on your Raspberry machine? In that way we could try to check if it is a general issue on that Internet connection/Rasperry or if it is specific to your application.

But from what I understand, the issue is not so much that the download is incomplete, but rather that you cancel it when it takes more than ten seconds, right?

I’ve already asked that one responsive user to do the curl check.

And yes, it seems to be the application giving up.

Ugh… just got more information back from that user: curl would succeed. So it’s in our code. Which is very surprising, as there are thousands of successful installations out there. Or a combination of code, ISP (Unitymedia), underlying software? I’m running out of ideas.

Or maybe all those thousands of “happy” users just didn’t realise the update failed. I’ll set up a Pi here.

Maybe you could “reconfigure” the application to run cURL instead. I’d look into TLS and HTTP issues, maybe some of the headers are not handled properly.

Interesting: one of the other reporters posted his issue from an IP address owned by Liberty Global - which is the owner of Unitymedia, the ISP used by the first reporter.

I’d file this as coincidence. Considering that the other user could successfully download it via standard means, I’d say the issue is somewhere in the application code.

I’m not sure it’s a coincidence, as it happened with four different versions of that file to the same users and started around the same time.

But there are new pieces of information trickling… that one responsive user ran all his tests with the application itself from Docker containers on Mac and Pi3. But I’m not sure about the curl tests yet. Yet another variable to try to get out the picture.

And another update: my user reports that running the same Docker image behind a different ISP’s connection would work… It’s some oddity between our application, Unitymedia and some files only. Is this really possible?..

Not really :smile:

Well, technically there could always be some obscure combination but it is unlikely. When they download the file via standard means, does it always work? In that case you’d know there must be some issue with your application.

Agreed, it’s likely the application. But that thing is working fine for hundreds of thousands of users all over the globe, and even for the user we’re talking about here and most of the files he downloads. It’s really this one file hosted on CF which causes issues. An odd combination of app bug, ISP and hosting service… how to track that down if you can’t reproduce it yourself?..

You need to debug your code and if you cant do that remotely, simply stuff your code with debug statements and have the user in question run it until you figured out what the issue is. Maybe the user can provide some SSH access.