Get list of cached pages / assets via API?

developing_apps

#1

In the case that I’d like to build an app that needs to be able to crawl a list of all pages / assets on a site, is there an API endpoint where I can get this data?

E.g. I build an app for Wordpress users that are using CloudFlare, and I need to be able to index text content and images from all article pages. Instead of querying a Wordpress server, it would be great to call CloudFlare to get that data from cache.

Thanks


#2

That’s not a feature we provide and if the item is in the cache querying it from external will retrieve it from the cache not the site not the WordPress server (unless you’re building a WordPress plugin in which case the local server/database is probably the place you want to get that data from).


#3

Thank you. To clarify, in the cached CloudFlare data is there a schema/bucket/path concept that somehow maps to the site that’s using it?


#4

Not really, we’re a pull cache/ proxy. So we will serve content for a specific URI if we have it in cache at a particular POP otherwise we’ll go to the origin for it. For non-static content or content which has expired and purged from our cache or whihc is not cached due to content type we won’t have a record of it. And we don’t really index or map it as there’s not really a reason to since we are primarily concerned about the URI for an object.


#5

Gotcha, understood and thanks.

So if I have something like a sitemap of URIs, I could query CloudFlare for each and parse the results accordingly? Is there a way to request URI data in batch?

Can you link me to an example response for the data associated with a URI?


#6

Not really possible through Cloudflare APIs. We don’t treat cached data like an AWS storage bucket for example and in each colo it’s entirely possible we’d have different assets cached. If you made a request for example.com/whats-new/ and the customer had (for example) cache everything set for that static assets all of the requests would be served from our cache if they were already cached. But we don’t have any type of index/map which would even allow you to say “purge all assets associated with the whats-new page” as each asset from the html page itself to each image on the page is stored according to its unique URI and we (mostly) do (dumb) pattern matching to the web requests we receive…

Oh you wanted https://example.com/images/upload/december/cool-logo.gif?
Do I have that in my cache already?
Yes!
Is it expired?
No.
Here you go.

Since we’re pulling from the origin itself we sort of don’t care what the overall site map/index looks like (unlike perhaps a push cache might). So there’s no way to really query our cache via API to see what objects are stored (there’d potentially be >100 different answers depending on what a given colo had.

So instead really the only indicator for the cache status for an object is from the particular colo you are querying for a particular URI. Here’s the cloudflare homepage (just the html bits and) response edited for clarity:

curl -I https://www.cloudflare.com
HTTP/2 200
date: Tue, 21 Nov 2017 15:07:34 GMT
content-type: text/html; charset=utf-8
cf-cache-status: HIT
expires: Tue, 21 Nov 2017 19:07:34 GMT
cache-control: public, max-age=14400
server: cloudflare-nginx
cf-ray: 3c148f35bf338c58-SFO

In the above section the cf-cache-status indicates whether the object is cached (but only for the SFO colo which was the colo which served my request. On https://api.cloudflare.com/ you’ll see that we have a cache purge API where you could request a particular item (or group of items) be purged, but there’s no API to list individual items.

If I were building a Wordpress app to index the content I would probably look at communicating directly with the SQL database as all the meat is really stored there and can be queried relatively inexpensively.

Does that make sense?


#7

Yes, thank you so much for all the detail, very helpful