Spotify, SnapChat, eBay are the examples I usually give when asked that question.
I'm not sure what numbers I am allowed to provide for any of them, but there's some public information available that gives you a sense of the scale involved:
Host a high-traffic site on Google's infrastructure. Since you can see the version number of the platform, it's obvious when they're rolling out changes. This has cause many partial outages until the change was (I assume) automatically rolled back.
It's a little hard to take this advice from Google after being the victim of so many bad rollouts. Because we use a lot of services, we are far more likely to have problems. We seem to always be the canary.
Not mentioned in this blog post is rollout related outages.
It is common for a system to work fine before and after a rollout, but during a rollout clients experience errors.
One might imagine downloading a big file for example, which takes an hour. If you are downloading that from http-server-v1, which is being upgraded to http-server-v2, there will be a small grace period for clients of v1 to complete their operations. That grace period in many datacenters is often 30 seconds. That means if your operation is long running, you would see a failure. The error code is usually HTTP 503, for which your client logic should retry the request with an exponential backoff.
If your client doesn't retry/resume the request, now you see the service as down, when in fact the error you are seeing is by design. It will happen for every release, but also when servers come and go for maintanance, or for a bunch of other reasons.
Good libraries will handle retries for you, but some don't properly, and thats a bug.
Sure, but I'm pointing out that this strategy relies on real customers encountering an error. I caution people to not forget that is a failure for us trying to ensure reliable websites.