Hi all, i’ve been monitoring the Cloudflare status page since the 8th of December 2019 (137 days ago as of now), and i’ve collected some interesting data regarding the uptimes of colos.
The PoP with the least downtime is Brisbane, QLD, Australia, sitting at 55.2% uptime (seriously?!)
I have a Python script that parses the Cloudflare status page every 10 seconds and takes note of any changes, I then parsed the logs and formatted a nice Excel document from this. More specifically, I count downtime as the total time a COLO is spent as “re-routed”, I do NOT count “Degraded performance” as an outage.
This leads me to my main question, why are there 28 COLOS with <90% uptime? What is happening at these locations? Also, what the ■■■■ is happening in Brisbane? 61 days spent offline?
But this is not measuring the actual uptime of a particular datacentre. It just aggregates the publicly available information on outages, right?
It is difficult to say how accurate that is then. For example, today there was a brief apparent outage of a PoP in my vicinity, which is not listed on the status page at all.
It aggregates publicly available information on the status we report of our POPs as defined at the airport code level.
a. Cloudflare reports more transparently than any other company I’ve every worked for.
b. A given city in many instances includes multiple datacenters.
c. Cloudflare is the most interconnected company on the planet (over 8,000 peering connections… more than AWS, Google or Netflix) and the internet is fragile… with 8k interconnects a broken link in a datacenter could lead to a degraded status.
d. Cloudflare runs an anycast network. If a datacenter was truly unavailable, traffic is routed to another colo. So a colo’s uptime <> service uptime.
e. Our criteria for degraded or rerouted are our own. If an ENT customer in OZ was paying us to deliver Aussie rules football streaming and we decided to stop delivering traffic for a portion of pay as you go customers out of the Sydney PoP during a popular match to ensure we had capacity for the football game we might report that DC as degraded (hypothetical… I don’t make the decisions but it seems perfectly reasonable to me that we might choose to do that).
… and I guess, not all infrastructure across the world is equally reliable/ available. Power outages, network failures and ‘other’ challenges can exist in greater numbers in some areas. Not deploying a colo there could pad an appearance of uptime/stability, but the value of Cloudflare increases for it’s customers the more we expand our network and the closer we can locate colos to the end users who consume the service(s).
Which is pretty damn accurate, when a PoP shows as re-routed then no traffic whatsoever will reach the PoP. I know personally because my local PoP is one of the ones that is always offline