I have to wonder how a company at the scale of GitHub can be so bad at keeping t...

xuancanh · 2024-08-14T23:29:25 1723678165

It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.

ParetoOptimal · 2024-08-14T23:43:04 1723678984

> The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.

This defeats the purpose of a status dashboard and is effectively useless in practice most of the time from a consumers point of view.

consteval · 2024-08-15T13:40:42 1723729242

From a business perspective, I think given the choice to lie a little bit or be brutally honest with your customers, lying a bit is almost always the correct choice.

ParetoOptimal · 2024-08-15T16:18:22 1723738702

My ideal would be if regulations which made it necessary that downtime metrics had to be reported with at most somewhere between a 10m and 30m delay as "suspected reliability issue".

If your reliability metrics have lots of false positives, that's on you and you'll have to write down some reason why those false positives exist every time.

Then that company could decide for itself whether to update manually with "not a reliability issue because X".

This lets consumers avoid being gaslighted and businesses don't technically have to call it downtime.

insane_dreamer · 2024-08-15T14:03:45 1723730625

Liability is their primary concern

x86a · 2024-08-14T23:20:31 1723677631

This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.

thiagocsf · 2024-08-15T02:30:33 1723689033

That it may be but there’s no excuse.

Declare an incident first, investigate later.

Cheating SLAs by delaying the incident is a good way to erode trust within and without.

antimemetics · 2024-08-15T04:56:30 1723697790

> Declare an incident first, investigate later.

If that would be the best way to deal with it- why is literally no one doing it this way and what does that tell you?

adgjlsfhk1 · 2024-08-15T05:14:02 1723698842

because it involves admitting that you messed up which companies are often disensentivized to do

ErikBjare · 2024-08-15T08:16:11 1723709771

False positives?