Operating a large distributed system in a reliable way: practices I learned

LaserToy · on July 17, 2019

I would add - you should start your monitoring with business metrics. Monitoring low level things is good to have but putting whole emphasis on it is missing the whole point. You should be able to answer at any point of time whether users are having a problem, what problem, how many users, what are they doing to workaround?

In other words, when person is in ER, doctors are looking for heartbeat, temperature, ... , not for some low level metric, like how many grams of oxygen is consumed by some specific cell.

cperciva · on July 17, 2019

Yes, in the ER, doctors look at things like heart rate, respiration rate, and temperature. But they also draw blood for electrolytes, glucose, creatine kinase, etc. since those can detect underlying problems which the body is compensating for.

A well designed distributed system is going to be able handle a certain failure rate in its components because requests will be retried automatically. If a component's failure rate increases from 0.01% to 0.1%, there will probably not be any user-visible impact... but if you can detect that increase, you might be able to correct the underlying issue before that component's failure rate increases to 1% or 10% or 100% -- at which point no amount of retrying will avoid problems.

LaserToy · on July 18, 2019

I;m not saying you should not do that, I'm saying your focus should not be there. When things go south your CEO rarely is interested in CPU utilization, or error rate. The question they usually ask: How bad is it? What is user impact? Is it all hand on deck or it is just a glitch. When it is cascading, which system to fix first? Root cause? Component failure rate just doesn't have enough context.

And yes, distributed systems are hard, because it is inherently hard to reason about what will happen when something changes/fails.

I'm not from Uber but I recently worked (was responsible for a huge chunk of infra) at a company that is bigger and has more products running and I saw some hilarious failures. And what is different, it was rarely a bad code push, and when it was, due to the nature of the business, sometimes it was really hard to roll back.

GauntletWizard · on July 18, 2019

You're missing the forest for the trees. The deeper metrics are important for diagnosis. The business metrics are what you care about.

I'd phrase your advice differently as well: monitor on contact points. Monitor where rubber meets road. Monitor where two components interesect, be it between teams working on frontend and backend components, your application and it's database, or low level metrics - the system API. I've found many interesting problems and averted several potential crises because I saw that the metrics things like requests looked different between the client and server, the system metrics and my applications, etc. Contact points are what's important, and it just happens that every (Figuratively) application has a contact point with the OS.

cperciva · on July 18, 2019

But that's my point: The deeper metrics aren't merely useful for diagnosis after the business metrics go sideways; they can be useful as leading indicators, to warn you before problems reach the point of affecting the business metrics which you care about.

linza · on July 18, 2019

The deep metrics, as you call them, get vanishingly irrelevant the larger the system gets, at least for the use case you mention.

To put it unscientifically, something always goes wrong, but that does not mean SLA is violated. As a corrollary, you will be alerted constantly about something that might go wrong because of some signal that someone thought might lead to some outage.

At face value it seems proactive, but it's really not. Better spend time actively making the system more reliable (e.g. look at your or someone else's postmortems, or do premortem exercises).

jorangreef · on July 18, 2019

On the contrary, low level metrics become more important the larger a system gets, since the quantity of low probability events (e.g. cosmic ray bit flips) increases.

Without those "irrelevant" low level metrics for a large growing distributed system, you're not only flying blind but crashing increasingly more and more often.

linza · on July 18, 2019

You are just chasing your own tail by looking at those metrics. Your system needs to be resilient to these random events and you must account for those, that's for sure. But you would not put your focus on those, it's still the higher level alerts that are more relevant to the business.

Crashing more often is caught by an alert that looks at total capacity is s service. Doesn't matter if it's random bit flips or OOMing nodes at first. Longer term these metrics can be useful to increase efficiency again (should i first fix random bit flips or OOMing tasks). I would not base SLA relevant alerts on too low level alerts.

jacquesm · on July 18, 2019

This is simply wrong. Quality starts in the basement, and works its way up. So if you monitor a system at the lowest level you will be able to build confidence about those layers and you will spot trends long before they will become apparent at higher levels because those higher levels will erase the fact that at lower levels things are already going wrong.

Systems that only monitor the highest levels appear to function fantastically well right up to the moment they crash spectacularly. Then the forensics will show you that at lower levels there were plenty of warning signs telling you that the system was headed for the cliffs and a large number of those signs will be apparent while there was still time to do something about them. Reliable systems engineering is not something you can do just at the highest levels because of the build in resilience in intermediary levels. This is counter-intuitive but born out in countless examples of systems that look robust but aren't versus those systems that really are robust.

antpls · on July 21, 2019

After reading this thread, everyone convinced me there are good arguments for having different levels of metrics. I think we can better understand the problem if we introduce the word "priority".

Priority is always the customer. The focus can change given the circumstances and the context of each issue. Sometimes the focus must be high-level, sometime it requires a low-level focus, but the priority is always the customer.

bcoates · on July 17, 2019

This is a good guide. One thing I'd add:

While you're monitoring for traffic/errors/latency throw in minimum success rate. Make a good estimate of how many successful operations monitored systems will do per minute on the slowest hour of the year and put in an alert if the throughput drops below that. You'd be surprised how many 0 errors/0 successes faults happen in a complex system, and a minimum throughput alarm will catch them.

mkeedlinger · on July 17, 2019

In some systems it's also nice to have a "canary" acting as the user to call your API every minute or so. Then even during low / no traffic hours you can still catch errors/outages/etc.

lrem · on July 18, 2019

That's a prober, not a canary.

nitrogen · on July 18, 2019

Sometimes things can be described in more than one way. Canary, readiness probe, health check, heartbeat, etc.

It makes communication more efficient and enjoyable when all parties are aware of this.

lrem · on July 18, 2019

Yes, there are many words in the space. But there are also many classes of systems. It makes communication more efficient and enjoyable if all parties can think of the same thing when hearing a word.

cpursley · on July 17, 2019

Can anyone recommend MOOCs and/or university courses (open syllabus) covering Distributed Systems?

elamje · on July 17, 2019

Yeah, I just asked a similar question on HN and didn't get many responses, but one overwhelming book rec was "Designing Data Intensive Applications"

Basically a high level guide through modern architectures, frameworks, and database designs. So far, my takeaway has been learning what tool would be useful for certain types of data engineering, not the details of how to write code with it.

Edit - link: https://news.ycombinator.com/item?id=20417801

dkersten · on July 18, 2019

I second that book. Got a pre-release digital copy a couple of years back and its awesome. Physical copy is on my desk. I should make time to re-read it :)

PhilippGille · on July 17, 2019

Some links to related resources (repos, articles, books, courses) can be found here:

- https://github.com/mxssl/sre-interview-prep-guide

- https://github.com/theanalyst/awesome-distributed-systems

One linked resource for example:

- https://github.com/aphyr/distsys-class

appwiz · on July 18, 2019

I like Effective Monitoring and Alerting[1] if you’re looking for operating a distributed system. It provides next level detail for this article.

  [1] http://shop.oreilly.com/product/0636920025986.do

lazyant · on July 17, 2019

Unfortunately doesn't seem to be good books out there or I don't know about them. The "Data intensive applications" book is highly praised but focused in databases. The other "Designing Distributed Systems:" O'Reilly book is solely about Kubernetes. The "Scalable Internet Architecture" is trash (sorry).

The best source for distributed computing I found is Facebook's engineering blog.

jammygit · on July 17, 2019

Also curious! I've looked in the past and had a lot of trouble finding one. A book recommendation would also be very helpful!

titanomachy · on July 18, 2019

Just go work as an SRE at Google for a couple years.

rurban · on July 18, 2019

Or better work at AVL for a few years. Their large distributed system is real-time, and on failures people might die. Esp. in the Formula 1 dptmt, where everything is 10x faster and larger with much more sensors.

On normal complicated control systems, such as in an airplane, you do have plenty of time to complete your task in the given timeframe, but in F1 there's not much time left, and you have to optimize everything, HW, protocols and SW. Much harder than in gaming or at scale at Google/Facebook/Amazon.

Eg you have to convert your database rows from write-optimized to read-optimized on demand. A simple fast database cannot do both fast enough. A normal firewire protocol is not fast enough, you have split it up into trees. A normal CAN protocol ditto, you have to multiplex it. You have to load logic into the sensors to compress the data and filter it out, to help transmitting the huge amount of data.

Avalaxy · on July 18, 2019

Have you considered that:

1) Maybe not everyone belongs to the 1% that can pass the interviews?

2) Maybe not everyone lives in a place that has a Google office where development is happening?

Sorry but you sound terribly arrogant here, pretending that it's just an option that everyone has and that the whole world lives in the same place where you live.

NewsAware · on July 17, 2019

Nice article. The amount of in-house procedures/tooling developed by the backend seems impressive (maybe some not invented here syndrome but can't judge really). What I am astonished at though, is that the backend part of Über seems so professional while the Uber Android app feels like it's build by 2 junior outsourced devs. Have rarely used an app which felt so buggy and awkward. (aside from regular crashes, e.g. when registering for Jump-bikes in Germany inside the app, I had to restart the app to have the corresponding menue item appear).

RhodesianHunter · on July 18, 2019

Really? That's a bit surprising to me as I've always thought their app was slick, easy to use, and responsive. Out of curiosity what device are you on?

NewsAware · on July 18, 2019

Huawei P10 lite. If that should be the explanation I would actually be relieved - as said, it didn't fit my mental model they wouldn't have a super slick app with so many Dev resources.

winrid · on July 19, 2019

The square "dot" in the loading indicator is not centered...

toolslive · on July 18, 2019

You need to have a strategy for backward (and forward) compatibility for your components. If the environment is large enough, you don't exactly know which component is running what version of the code as they are constantly (holding back on) upgrading some part of your system. This includes extra (paramaters to) RPC calls, data type evolution, schema evolution. Without a decent strategy you'll be in over your head quickly. (Tip : a version number for your API as part of the API v0.0.1, ain't gonna be enough)

techie128 · on July 17, 2019

Interesting. Although it is on the lite side. For example, it doesn't talk about chaos testing, defining effective and comprehensive metrics (KPIs), alert noise or running services like databases in an active-active (hot-hot) mode.

jwilliams · on July 17, 2019

Good read. There are a few things that I'd throw on top as important;

- Active monitoring

- Chaos testing

- Cold start testing

joshgel · on July 17, 2019

> I like to think of the effort to operate a distributed system being similar to operating a large organization, like a hospital.

Clearly never worked for a hospital. Hospitals need good engineers (and often don’t have them). Our ‘nines’ are embarrassing...

jpitz · on July 17, 2019

Are you referring to medical operations, or IT operations, in hospitals? I think he is referring to medical operations, where I would expect the relevant professions to be doctors and nurses, not engineers.

ikiris · on July 17, 2019

The point stands. I've been involved in big hospital management and at a FAANG. Hospitals are a horror show if you see behind the curtain in both respects, and others.

joshgel · on July 17, 2019

Medicine works not because we have learned to scale it in any sense, but on the contrary, because we still rely on individual physicians and nurses to provide care. So like the HR system of large hospitals is probably better/bigger than small ones in terms of on boardinging doctors, but the care at a large center is only as good as the individual physician or nurse caring for patients. Probably bigger, academic centers are (very slightly) better (though this is an active debate with recent literature in high profile journals on both sides), but if they are, I suspect its because they can recruit better individual doctors. This is very different from saying that their scale provides them some competitive advantage in providing care.

arkades · on July 17, 2019

Actually large hospital systems tend to hire one or two systems engineers to be part of the QI department. But yeah, most QI is front line staff.

ambicapter · on July 18, 2019

QI department?

arkades · on July 18, 2019

Quality improvement.

ggregoire · on July 18, 2019

Most of those advices apply to small non-distributed systems too.

drdrey · on July 17, 2019

I find it problematic that this recommends the Five Whys to get to "the root cause". Haven't we collectively moved past that?

teraflop · on July 17, 2019

Would you care to explain why you find that problematic?

drdrey · on July 17, 2019

See this post by John Allspaw: https://www.oreilly.com/ideas/the-infinite-hows

> The Five Whys, as it’s commonly presented, will lead us to believe that not only is just one condition sufficient, but that condition is a canonical one, to the exclusion of all others.

Five whys presents itself as a way to dig deep but promotes doing so linearly and getting to a singular thing you can fix, hiding a lot of potential learnings along the way. Thinking about contributing factors is a much more powerful framework.

lrem · on July 18, 2019

I find it useful: you'll find at least one problem deep down. The more surface problems will get worked on anyways. In any case, the point is to make your system more robust over time.

james_s_tayler · on July 18, 2019

The notion of a "root cause" itself is flawed. It's causes acting in concert which cause problems and they all enable each-other.

VincentEvans · on July 17, 2019

Did you do all these things by yourself?

Really great content, but was really taken back by “I” used everywhere. Maybe it’s a new thing that I am not hip on that I ought to try - “I built and ran transaction processing software for Bloomberg! This is what I learned!”

But perhaps you really did all that by yourself, in that case sorry that i doubted you, looks like it’s a lot.

learnfromstory · on July 17, 2019

Don't really agree that this list could have come about through discussions with engineers at Google, Facebook, etc. The more computers you have the less important it becomes to monitor junk like CPU and memory utilization of individual machines. Host-level CPU usage alerting can't possibly be a "must-have" if there are extremely large distributed systems operating without it.

If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.

kevinsundar · on July 17, 2019

I work at a FAANG and host level cpu is most definitely an alert we page on. Though a single host hitting 100% CPU isn't really a problem in and of itself (our SOP is just to replace the host), its an important sign to watch for other hosts becoming unhealthy. It might be overkill but hey theres mission critical stuff at hand.

For example: if you have a fleet of hosts handling jobs with retries, a bad job could end up being passed host to host killing each host / locking up each one as it gets passed along. And that could happen in minutes while replacing and deploying and bootstrapping a new host takes longer. So by the time your automated system detects, removes, and spins up a new host everything is on fire.

learnfromstory · on July 17, 2019

Could you mention which FAANG so I can avoid applying or a job there? Large-scale software systems _must_ be designed to serve through local resource exhaustion. If you are paging on resource exhaustion of single host you are just paying the interest on your technical debt by drawing down your SREs' quality of life.

I stand by my beef with this article. The statement that "I've talked with engineers at Google [and concluded that a thing Google wouldn't tolerate is a must-have]" doesn't make sense. What I get from this article is you can talk with engineers at Google without learning anything.

kevinsundar · on July 17, 2019

Im not at liberty right now to name my employer but our systems are definitely designed to serve through local resource exhaustion. But we aren't talking about cheap hosts here. We generally run high compute optimized or high memory optimized hosts depending on the use case and if these generally powerful hosts hit 100% CPU or full memory utiliziation theres usually more going on than something random or simple so its important to have someone check it out.

packetslave · on July 17, 2019

A single host stuck at 100% CPU also has a nasty effect on your tail latency, in a system with wide fanout. If a request hits 100 backend systems, and 1 of them is slow, your 99th percentile latency is going to go in the toilet.

learnfromstory · on July 18, 2019

Which is a good reason to hedge and replicate but NOT a reason to alert on high CPU usage of single computers.

packetslave · on July 18, 2019

You definitely want to TRACK cpu usage on individual hosts, but, yeah, I would alert on service latency instead. Symptom, not cause.

jandrewrogers · on July 17, 2019

This very much depends on the kind of software system. If there is parallel orchestration going on, such as join operators in a scale-out database, the performance of a single machine in the cluster can impact the performance of the entire cluster. In fact, the software will often monitor this itself so that it knows when and where to automatically shed load.

kjeetgill · on July 17, 2019

I'd say you should always have CPU monitored, but I get that you might night care to aggressively alert on it. It can be invaluable for hunting down root-causes after the fact: nothing's perfect from the first deployment. I single bad host is best if it crashes, but is a lot more dangerous if it's just wonky.

Things like CPU hopefully shouldn't be your key/gold service-up metric, but paradoxically, the more mature your system the more CPU can tell you; you can catch problems before they happen. It can help notice things like bad CPUs.

Memory stays pretty important in my experience; even more than CPU.

And in addition to all the other responses there are also different levels of pages: Some are page me at 5am, some can wait till morning, and some can wait till Monday. FAANG is more likely to have their own hardware so you actually get deeper/more diverse monitoring needs than a shop on AWS or something.

Source: FAANG-ish tier infra work

stillworks · on July 18, 2019

Granted the article is a bit of a "Decaf-Soy-Latte", but in my experience, whatever that can be monitored should be monitored.

Software deliveries/releases can often realistically be non-perfect. (Don't have direct experience with Canary releases TBH though)

In case anything goes wrong any objective evidence which helps to reconstruct the failure scenario is valuable.

Also... Murphy's Law.

>If you've designed software where the whole service can degrade based on the CPU consumption of a single machine.

Typically, if such software is indeed released, I think it will be several CPUs on several hosts.

yibg · on July 18, 2019

Outliers are where the interesting stuff happens, and outliers happen to individual instances. Aggregates are useful but can be very misleading. You can have milliseconds 99 percentile latency with ~1% of requests timing out.

I wouldn’t alert on a single machine having CPU issues, but I’m definitely interested in a small collection of individual machines all having CPU issues at the same time.

steven2012 · on July 17, 2019

This is an incorrect statement. CPU utilization and memory matters because it limits how many other containers you can load on the same host, and means that it becomes more and more expensive to run that particular service.

madhadron · on July 17, 2019

> If you've designed software where the whole service can degrade based on the CPU consumption of a single machine, that right there is your problem and no amount of alerting can help you.

Unless it's your database.

learnfromstory · on July 17, 2019

If you have "the database" then you're fucked anyway and probably your thing isn't on the scale that we are discussing.

cameronbrown · on July 17, 2019

If you're using a sharded SQL database then a single machine going bad could still affect thousands of people.