Hacker News new | past | comments | ask | show | jobs | submit login

I have to wonder how a company at the scale of GitHub can be so bad at keeping track of their status.

Now 4 out of 10 services are marked as "Incident", yet most of the others are also completely dead.




It's because of the way most companies build their status dashboards. There are usually at least 2 dashboards, one internal dashboard and one external dashboard. The internal dashboard is the actual monitoring dashboard, where it will be hooked up with other monitoring data sources. The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.


> The external status dashboard is just for customer communication. Only after the outage/degradation is confirmed internally, then the external dashboard will be updated to avoid flaky monitors and alerts. It will also affect SLAs so it needs multiple levels of approval to change the status, that's why there are some delays.

This defeats the purpose of a status dashboard and is effectively useless in practice most of the time from a consumers point of view.


From a business perspective, I think given the choice to lie a little bit or be brutally honest with your customers, lying a bit is almost always the correct choice.


My ideal would be if regulations which made it necessary that downtime metrics had to be reported with at most somewhere between a 10m and 30m delay as "suspected reliability issue".

If your reliability metrics have lots of false positives, that's on you and you'll have to write down some reason why those false positives exist every time.

Then that company could decide for itself whether to update manually with "not a reliability issue because X".

This lets consumers avoid being gaslighted and businesses don't technically have to call it downtime.


Liability is their primary concern


This is intentional. It's mostly a matter of discussing how to communicate it publicly and when to flip the switch to start the SLA timer. Also coordinating incident response during a huge outage is always challenging.


That it may be but there’s no excuse.

Declare an incident first, investigate later.

Cheating SLAs by delaying the incident is a good way to erode trust within and without.


> Declare an incident first, investigate later.

If that would be the best way to deal with it- why is literally no one doing it this way and what does that tell you?


because it involves admitting that you messed up which companies are often disensentivized to do


False positives?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: