Strange Request w/curl request and worker

We are utilizing a “CacheEverything” PageRule to Cache HTML and everything is working well; however, we noticed that a different page is cached each time a user comes in via a paid ad (i.e. it has something like utm=facebook&blah=1234).

I have built a worker so that the page is fetched and cached as if the URL didn’t have the query string. I’m about 95% there, except for one really weird issue that I can’t figure out!

If I hit the page via curl -I https://mysite.com/tracer360, (note the -I for headers only) I get content-length=0 – and then when I visit the page in a browser (if it’s the same node), sure enough the page is blank!

However, if I visit the page in-browser first, everything caches normally (I also see no content-length=0 in the header) and when I run the curl command, content-length=0 is gone.

Does anyone have any insight as to why this might be happening?

Here is the consequential code of my worker:

 addEventListener('fetch', event => {
  event.respondWith(handleRequest(event))
});

const someOtherHostname = "mysite.com";

async function handleRequest(event) {
  const request = event.request;

  // Cache Object
  const cache = caches.default;

  let response;
  
  //If the request is a POST request, do not store/retrieve from cache.
  if(request.method.toUpperCase() === 'POST') {
      // Always pull from the origin. 
      response = await fetch(request);

      // Must use Response constructor to inherit all of response's fields
      response = new Response(response.body, response);

     // No caching.  
     response.headers.set("Cache-Control", "max-age=0, no-store");

     return response;
  }

  // URL Requested
  const cacheUrl = new URL(request.url);

  // Map to MySite
  cacheUrl.hostname = someOtherHostname;

  // Get the URL without the Query String
  constUrlWithoutQueryString = `${cacheUrl.protocol}//${cacheUrl.hostname}${cacheUrl.pathname}`;

  // Get this request from this zone's cache 
  response = await cache.match(constUrlWithoutQueryString);

  if (!response) {    

    //If not in cache, get it from origin
    response = await fetch(request);

    // Must use Response constructor to inherit all of response's fields
    response = new Response(response.body, response);

    // Store the fetched response as cacheKey
    // Use waitUntil so computational expensive tasks don"t delay the response
    event.waitUntil(cache.put(constUrlWithoutQueryString, response.clone())); 
    }  

    return response;
}

For what it’s worth, here is the response via curl -I https://mysite.com/tracer360:

$ curl -I https://mysite.com/tracer360
    HTTP/2 200
    date: Thu, 27 Aug 2020 02:20:54 GMT
    content-type: text/html; charset=UTF-8
    content-length: 0
    set-cookie: __cfduid=db1dd92fa71e07a3da6a13a215c2ee3291598494854; expires=Sat, 26-Sep-20 02:20:54 GMT; path=/; domain=.mysite.com; HttpOnly; SameSite=Lax; Secure
    cf-ray: 5c9250e8a8bef0ee-IAD
    accept-ranges: bytes
    age: 4
    cache-control: s-max-age=604800, s-maxage=604800, max-age=60, max-age=2592000
    expires: Sat, 26 Sep 2020 02:20:49 GMT
    vary: Accept-Encoding
    cf-cache-status: HIT

And one that is viewed via the browser first:

$ curl -I https://mysite.com/tracer360
HTTP/2 200
date: Thu, 27 Aug 2020 02:22:54 GMT
content-type: text/html; charset=UTF-8
set-cookie: __cfduid=d11dc06f3afd0ef4b3ebfd38d79b753911598494973; expires=Sat, 26-Sep-20 02:22:53 GMT; path=/; domain=.mysite.com; HttpOnly; SameSite=Lax; Secure
cf-ray: 5c9253d35e34ced0-IAD
age: 3
cache-control: s-max-age=604800, s-maxage=604800, max-age=60, max-age=2592000
expires: Sat, 26 Sep 2020 02:22:50 GMT
vary: Accept-Encoding
cf-cache-status: HIT

It sounds like what’s happening is that your origin is returning a empty response for HEAD requests, causing the blank page to be shown and then cached for any other requests hitting the worker.
When you use -I, it causes curl to make a HEAD request instead of a regular GET one.

Couple ways of fixing this would be making sure your origin server returns a regular response even for HEAD requests. The other would be forcing the worker to make a GET subrequest regardless of what method the http client used.

2 Likes

Ah - good point! I will test this out and report back. Thank you!

@arunesh90 - To confirm you were 100% correct - thank you.

For those interested in the solution (I think this is a good approach); in the worker above, I simply added a check to see if it was a GET request. If so, store it; otherwise, don’t store it, and let Cloudflare do its thing…

    //If a GET Request; store it. 
    if(request.method.toUpperCase() === 'GET') {
         // Store the fetched response as cacheKey
        // Use waitUntil so computational expensive tasks don"t delay the response
        event.waitUntil(cache.put(constUrlWithoutQueryString, response.clone())); 
    }

This prevents the HEAD request from being stored with the cache key associated with that URL (which, as @arunesh90 noted, would have a length of 0).

2 Likes