Hi Cloudflare folks and Community, I am relatively new to using the service but today has not been a good day.
After a 10+ hour outage of Image uploads the status of the issue has been changed to “This incident has been resolved.”, unfortunately it’s very much not resolved, 80%+ of my image uploads fail with 5559 error and the web interface on Cloudflare site is not functioning with the same error.
So a couple of questions to the Community (and anyone from Cloudflare if you read this forums):
Is it typical for Cloudflare to provide misleading reporting on outages?
How common are such issues in your experience?
How likely are we going to see post mortem on this issue?
Agree that the length of the outage and especially the lack of acknowledgement is alarming (for Pages as well).
I have noticed a higher-than-expected number of incidents relating to Pages in the time I’ve been using CF, but I don’t typically follow the incident reports too closely so I can’t comment on #1. Re: #3 I would love to see a post mortem on this but I think it’s unlikely as CF only does a handful of post-mortems per year (despite many more incidents than that occurring).
First time I notice it in ~6 months using CF. Sometimes CF status will take a bit long to reflect an actual outage, but I have never seen anything like this. The actual outage lasted about 18 hours, but CF is “only” the current outage as 9 hours old.
18h outage for a major provider is catastrophic, but the worst part was having no communication, be it here or through their support.
We started working on a migration to AWS during the night as there was no communication or progress being shared on this issue that was bringing our production down, and although it seems everything is now back online we will still go ahead and finish the migration because we just can’t do business with CF under those conditions + other problems we have had with them unrelated to this topic.
So, to summarize - the issues seems to be resolved as of now, last time I’ve seen errors in my monitoring was roughly half an hour ago making this a ~22 hour outage.
I don’t have exact timing but I recall it took the status page about half an hour to reflect the issue after I started seeing all my upload failing.
After a 10 hour outage the issue has been declared as fixed and images service was displaying status as Green, but it has not been fixed https://www.cloudflarestatus.com/incidents/64wsqjxnljn3.
Within an hour another issue has been opened but did not acknowledge issue with Images for another 3 hour https://www.cloudflarestatus.com/incidents/msghx93bcjjd
The issues seems to be fully resolved as of half an hour ago.
A couple of thoughts based on this:
I know there is no SLO/SLA expectations for non Enterprise clients, but dropping below 99.8% is a big OOOF for me.
The actual state of the service has not been communicated right during the outage. I can only see 3 reasons and none of them is a good one. Either the Cloudflare monitoring is insufficient to see such global issues, the actual scope of the outage has been obscured or no one bothered to communicate in a manner sufficient to people who rely on this information (i.e. developers like me).
There has been no proactive communication, Cloudflare reps seem eager to talk to me about the Enterprise plan, perhaps they could’ve reached out after like 6 hours of outage? Or an automated email? Good thing I have monitoring in place for everything, but on Sunday I would assume many people got hit by this in a much more confusing way.
I have also not seen any official communications in the forum that addressed the issue.
My heart goes to all the SREs in the war room, but as a company - what a mess.