I wonder why the status just doesn't ping github.com for 200. That seems easy to...

bigiain · 2024-08-15T00:59:13 1723683553

To be fair - I really couldn't care less is the homepage is loading or not.

So long as I can fetch/commit to my repos, pretty much everything else is of secondary, tertiary, or no real importance to me.

(At work, I do indeed have systems running that monitor 200 statuses from client project homepages, almost all of which show better that 99.999% uptimes. And are practically useless. Most of them also monitor "canary" API requests which I strive to keep at 99.99% but don't always manage to achieve 99.9% - which is the very best and most expensive SLA we'll commit to.)

fragmede · 2024-08-14T23:27:35 1723678055

from where? they don't only have one load balancer, so you'd still have the problem of the page showing green when it's not loading for some folk?

ergocoder · 2024-08-14T23:40:05 1723678805

At Github's scale, why wouldn't they put a ping monitor from every continent at least?

Then, you would show the status based on the continent.

fragmede · 2024-08-15T04:09:29 1723694969

Where on the continent? GitHub is undoubtedly doing blackbox testing internally and has multiple such monitors but that's not going to capture every customer's route to them, leading to the same problem - customers experience GitHub being down, despite monitoring saying it's mostly up. Thus the impass. Even doing whitebox testing, where you know the internals and can this place sensors intelligently, even just for ingress, you're still at the mercy of the Internet.

If a sensor that's basically in the same datacenter says you're up, but the route into the datacenter is down, then what? multiply this by the complexity of the whole site, and monitoring it all with 100% fidelity is impossible. Not that it's not worth it to try, there's a team at GitHub that works on monitoring, but beyond motivation about keeping the SLA up, as a customer, unless you notice it's down, is it really down? In a globally distributed system, downtime, except for catastrophic downtime like this, is hard to define on a whole-site basis for all customers.

laserlight · 2024-08-15T05:35:34 1723700134

> 100% fidelity is impossible

I don't think anybody asked for 100% fidelity. We are talking about a complete outage that affected at least North America and Europe. If the status page shows green in such a case, its fidelity is around 50%. People expect better from GitHub.

fragmede · 2024-08-22T05:59:47 1724306387

The amount of moaning that the status page wasn't updated in 0 seconds and had the wrong status for entire minutes is what leads me to believe that no, users do expect 100a% fidelity.

Total outages are rare enough, and there's enough other work, that spending time building a system for that, just doesn't seem like the best use of their time. though I'm biased, having faced that exact question from the inside, at different company.

ergocoder · 2024-08-15T07:34:36 1723707276

> monitoring it all with 100% fidelity is impossible

This is impossible regardless of how godlike the design is... Nobody is asking for 100% fidelity.

intelVISA · 2024-08-15T03:11:31 1723691491

That would be self-defeating given that it's a Rails app.

tinyhitman · 2024-08-14T23:10:34 1723677034

delaying SLA

sebmellen · 2024-08-14T23:10:58 1723677058

This is at least a multi-million dollar payout (if they admit to it).

All GitHub Pages say

> We're having a really bad day.

> The Unicorns have taken over. We're doing our best to get them under control and get GitHub back up and running.

ljahier · 2024-08-14T23:58:47 1723679927

At the moment, all github services seem to be restored, and the github status indicates that the problem is still ongoing. I don't think it's related to the SLA, but rather to the monitoring, which is not live. There are a few minutes of delay.

cbates · 2024-08-14T23:24:47 1723677887

Seems slightly unproffesional for a massive company like Github/Microsoft.

xp84 · 2024-08-14T23:38:58 1723678738

I disagree. This hurts no one, and not everything needs to be sanitized and painted over with bland corporatespeak.

majewsky · 2024-08-15T08:57:42 1723712262

I don't think they were asking for corporate speak. But at least I would find a plain technical error message like "cannot contact file server" much more respectable than something like "unicorns are hugging our servers uwu".

COMMENT___ · 2024-08-15T18:30:38 1723746638

This “ironic” and “humorous” style of errors and UI captions is the actual new corporate speak. I’d prefer dumb error messages rather than some shit someone over the ocean thinks is smart and humorous. And it’s not funny at all when it’s a global outage impacting my business and my $$$.

colimbarna · 2024-08-14T23:51:43 1723679503

It's closer to the truth than you usually get. They're having a bad day, it's completely true. It's the start of my day, but I guess this is the middle of the night for them. There's no such thing as unicorns, but that just highlights the metaphorical nature of the remaining claim - getting Unicorns under control means solving their problems. Normally "professional" corporate speak means avoiding saying anything whose meaning is plain on its face and disconfirmable while avoiding the implication that the company is run and operated by humans. This is a model. (Obviously the came up with the message in advance, which just goes to show that someone in the company is well enough rounded to know that if it is displayed, they're having a bad day.)

wrs · 2024-08-15T01:11:16 1723684276

GitHub is (was?) a Rails application, so it was probably originally running behind Unicorn [0], if it isn’t still. So the unicorns are (were) real.

[0] https://en.wikipedia.org/wiki/Unicorn_(web_server)