Hacker News new | past | comments | ask | show | jobs | submit login
Rapid release at massive scale (facebook.com)
220 points by lxm on Sept 1, 2017 | hide | past | favorite | 97 comments



One of the undersold (imho) parts of this post is the system that allows tiering of releases with gates for their release. When most companies talk about CI/CD, they mean that master gets deployed to production, full stop. Rollbacks mean changing the code. In reality, when code hits master, there is ALWAYS a lag while it gets deployed, and it's worth having a system that holds that source of truth. Where release engineering gets interesting is how you handle the happy path vs. a breaking release.

I like that Facebook separated out deploy from release. It means that you can roll the release out relatively slowly, checking metrics as you go. Bad metrics mean blocking the release, which means turning off the feature via feature flag. I think for the rest of the world, that would mean halting the release and notifying the developer.

Disclosure: I work with smart people who spend lots of time thinking about this and writing blog posts like "Deploy != Release": https://blog.turbinelabs.io/deploy-not-equal-release-part-on...


This is true, except it has a huge underlying requirement: that all deployments are forwards and backwards compatible. i.e. a running service must be able to talk to the older version of itself, and vice versa (and of course the chain of dependencies). This is a much bigger knowledge investment, easier said than done.

It pays off in the end, but not worth making it a "criteria for success" when breaking out from branch-based to trunk-based continuous delivery, otherwise the trunking will most likely end up never happening.

shameless plug: at goeuro.com we shifted from branch-based to trunk-based CD in a short time (<3 months) with a diverse set of services and workloads, by applying a holistic socio-cultural, technical and process approach. Could be of interest if anyone is trying to make a switch: https://youtu.be/kLTqcM_FTCw


> that all deployments are forwards and backwards compatible

This is critical regardless in SoA / anything other than strict blue-green deployment.


This is what we decided to do as well at my work. We've been using LaunchDarkly to handle the feature flagging, and we slowly migrate features from 0% to 100%, or first roll out features to less important customers, etc.

Works like a charm, our product owner and engineering team are happier than ever, since it takes away a big part of the risks and makes rollbacks a lot faster.


How are they able to achieve slicing their userbase into thousands of pieces to do this gradual release?


One way to accomplish this would be reverse proxy layer with sticky sessions that routs some percentage of traffic to X and some to Y.


This is in fact exactly what we (https://www.turbinelabs.io) do. With enough proxy work you can tee some traffic to version Y and observe success rate/latency, then discard the response (or compare it with version X's response), at least for idempotent requests.


A sort of A/B test, where what is tested is the implementation.


>>> Where release engineering gets interesting is how you handle the happy path vs. a breaking release.

It's simple. Never make a breaking release.


This is basically a: "We kill the Batman" statement, on its own :)


On the opposite, it's more of a "dont kill the batman".

How hard can it be to not do something? Not very difficult.


But you are doing something, writing then releasing software, regardless. It's silly to claim that "not making mistakes" is easier than "making mistakes".


Yes, it's very easy if you never release.


You're referring now to "Don't commit any crimes"


It makes sense upon reflection, but something I thought interesting was how they run linters and static analysis tools in parallel with building. I've been used to build-and-test pipelines where these are done serially, because what's the point in building if the linter fails, and what's the point in static analysis if the build fails? But the point is, they can be done in parallel and that reduces latency of feedback to the developer.


Note: Everything in this comment is based on personal experience and observations as a Facebook user. They're all my opinions too.

In my experience of using Facebook (primarily groups), it is a highly buggy platform and it's very hard to say that it behaves consistently or even that features are really ready before release - this, IMO, implies that both development and testing as well as rolling out the releases are messed up. This post talks about building a better "conveyor belt", so to speak, to release changes, but if the basic product is buggy and didn't get good attention in design/dev/test, no improvements in the "conveyor belt" can help make it awesome.

Standard features that have existed for long may or may not work (how good are the regression tests then?). Posts and comments in groups may sometimes just disappear (thank goodness an admin activity log was added sometime in the last several months so we can stop wondering if an admin deleted anything). There's a feature in groups to mandate people wanting to join a group to answer some questions setup by the admins. Most people submit the answers but that never gets saved, and it's unknown what the trick is to get the answers to stay (this has been around for several months now?). New features aren't always announced.

I see Facebook as a platform that's used to share ephemeral things. So this level of quality is probably just ok (though I don't believe it justifies the company's revenues and valuation).

Since I do not conform to Facebook's ridiculous policy on using "real/authentic names", I don't even venture into contacting support if I see any issue, lest my presence be obliterated (yes, I try to keep away from Facebook, but do need it for some important awareness building because there's a large audience there).

As a platform used by billions, Facebook still has a very long way to go in being reliable.


It's not clear from the text how they deal with the severe production bugs. By the time the bug is found, the master branch is full of new code. So your bugfix has to deploy with all that new code?

And with so many people checking in to the master branch, how is it not permanently broken? With 1000 devs pushing code, you're bound to have severe bugs daily.


It's not like people push straight to master. There is most likely a pull request system or other form of review tool in between, with both code review and static analysis.


I can't be the only one immature enough to read that title as a double entendre.


Clearly, the solution is a middle-out approach. ;)


You very well might be


Is their distributed database code also in this same repo and goes through the same process?


(Former FB eng.) No. Storage and backend services are on a different tier and release schedule entirely.


I've had a deployment cycle like this since I started using Google App Engine... 6 years ago.


The interesting thing about the article isn't that they're able to release continuously; there's nothing technically hard about deploying quickly. The interesting thing is that they're able to make the continuous release system work with thousands of engineers actively working in the same codebase without destroying quality. The three-tiered release system, monitoring alerts, feature flags, and good testing infrastructure seem to be what makes all of that possible.


Exactly - releasing is the easiest part of the process.

A lot of orgs don't have continuous deployment because of reasons such as:

- they don't have a good enough automated testing suite (or at least don't trust it fully), and thus rely on "sign offs" to have people commit to saying it's quality

- they don't measure in production properly (no real error alerts, no way to measure release success), and often deal with things in a "go or no-go" type way

- they don't canary test. To me this one is critical - the only way to get real production use is to have real production users actually using the site/platform/app, just a sample of them, to see what could go wrong, especially with new features

A lot of managers I've worked with are shocked whenever I pull out the "continuous deployment is easy. doing it well is hard" line.


Yup, we do all that, except the 10k developers scale. Ha.

I had the advantage of starting with fresh codebases and a small team. Obviously, adding this to an existing organization is much more difficult.

When you setup your system correctly from the start, it also becomes a great hiring tool. Once you show developers the environment you work in, mouths drop and they almost beg to work for you.

Doing it is hard, but not impossible. At this point there really is no excuse to not start things off like this. It is about 1-2 weeks of effort to setup a new project with all the right tooling on top of GCP thanks to the features they give you as part of their platform.


I find a big thing too is building up that mindset of always roll forward. Bug in production means next production build will fix said bug. No hotfixing to say the release is done etc.

Completely agree that from a tech side there is no excuse, however a lot of QA culture has persisted through orgs and they want to keep that feeling of control (even though automation does it way better than them)


A lot of organization don't have continuous deployment because they can't risk breaking everything for any developers who is playing around. When they want to release, they review everything, test and go through QA.

Facebook is not important. It has no impact when it's broken.


I helped build a business that did about $80m in gross revenue in the first year. We launched the initial version in 3 months (which we predicted to within a week).

Started with 2 engineers (myself and another guy) and grew it to about 15. Zero QA, Zero DevOps.

We had CI/CD and a full test suite. We deployed from master as many times a day as we needed / wanted.

It can work if you open your mind to it and you hire the right people who know what they are doing.


And I was at business with $800M and 30 employees.

Just because it releases quickly and has no QA doesn't mean it's a good thing.

The only metrics that matters is calls from your users. Facebook doesn't even have a number to call when it's broken.


Keep in mind that the quarterly earnings report was 9.32 billion USD. That is approximately 100 million a day. I'd say downtime at that scale is important.

Also: People tend to call the police: http://time.com/3071049/facebook-down-police/


It's extremely low revenues per person and per page view.


Except to their own companies bottom line, generating millions in revenue?

Releasing less often is a way to guarantee that bigger bugs will get through at some point, requiring hotfixes etc. The more you release, the higher quality releases you have, and the smaller production incidents.

The point isn't to remove QA, it's to trust that the automation in place is high quality and will catch the majority of issues before they are issues (i.e. if something makes it through the automation, then it should be caught in the internal release, or at least the 3% canary group), and then on the back of issues, make the automation more robust.

The more people who are introduced into a process, the more likely it is to fail at some point - the fact is that Facebook has a pretty low rate of huge production issues compared to most software companies, they must be doing something right.


Whereas I'm sure whatever you worked on had 1.32 billion active users everyday.

Facebook is "7th most valuable company in the world" important.


1.32 billion FREE users.


1.32 billion units of inventory

FTFY


It's interesting that static or dynamic automated security testing don't exist in their process.


when both the delivery (pipelines) and the units going in them (container images with deployment descriptors) are automated, its really easy and straight-forward to plug-in a variety of automated checks (e.g. https://github.com/coreos/clair, organizational policies, governance, etc)


I have to believe it is part of their pipeline. Even smaller companies do at least static analysis, if not much more.


"engineer productivity remained constant for both Android and iOS, whether measured by lines of code pushed or the number of pushes."

Both of those metrics are incredibly shit ways to measure productivity.

I guess thats one explanation as to why the facebook app is 200mb+. They've been superproductive with all those lines of code.

A better metric would rely on actual features or bugs, imo.


I think it's obvious that they're bad at a small scale (individually or for smallish teams, maybe < 30 engineers?), but I don't think they're _obviously_ bad for teams of 50+ engineers.

Also, there is a difference between using them as metrics that you want to raise vs metrics you just don't want to drop or fluctuate wildly over time.


Some people at Google are actually quite proud of six digit _negative_ line counts they've contributed. And I think they have every right to be proud of that.


Facebook has a "dead code society" t-shirt for these kind of single commit delete numbers as well :)


That’s why you track changes made.

Throughout my career imve found that number of changes comitted correlates pretty well with features delivered and bugs fixed. Also with feature and bug size.

Yes there are edge cases when it takes a day of debugging to make a 1-line fix, but those are rare. Just like it’s very rare to deliver a useful new feature by changing a single line.

Yes there are also features that are tracked as a real ticket and require a 1-line copy change and nothing else. Nobody thinks doing those makes you hella productive, it just needs to be done.

As for padding lines and changes. That’s what code review is for.


That would be a bad metric for those "negative" folks as well. Their job is called "code cultivation", and they modify everything in Google's monorepo to make it suck less. Often this is done using automated tools, with a subsequent full run of affected tests and human code review. I've never seen anything like this anywhere else, and I probably won't for the rest of my career.


As to what one should track, one should track _results_ if you've managed to accomplish something useful with very little code, that is decidedly better than accomplishing the same with much more code and hundreds of check-ins. Simplicity is the ultimate sophistication.


> one should track _results_

One should, but companies often care more about effort than results. They can manage based on effort, they can't manage based on results.

If you spend 2 days getting the same results as somebody else does in 5 days, guess what, they don't want you milling around those extra 3 days and bringing morale down. Gotta give you more work!


And worse: in more than one megacorp (including Google in late aughts) I've seen _complexity_ as a requirement for promotion. Google has tried to get rid of that de jure, but it remains a requirement in practice, so people do what they are rewarded for: create complexity.


Well at that point I think you've hit the agency problem where management's goals and the organization's goals are different. The organization's goal is delivering results but that is not necessarily what management wants right now.


If you use "number of lines of code" as a metric for productivity, soon people will double-space their code. We went through that in the 90s.


Were there any other signs you had shitty programmers on your team?


It's not about shitty or good, it's about management. Whatever you reward, that's what people will give you.

See also: https://users.cs.duke.edu/~ola/courses/cps108/code/java/harp...


You're arguing against an extreme, though. Sure, if LOC added (or even just changed, which is the saner metric) was linked directly to bonuses, it'd be ripe for abuse. But that completely ignores the more reasonable use in the article, as one (of several) metrics used to measure the impact of their release process.


You should measure what you want more of. Unless for some reason you want more lines of code, then use a different metric.


They should measure and reward lines of code removed instead.


Measuring productivity in LOC...


I just think they are bad for anyone. Its not hard to measure actual useful things like features or bug.

Why even bother measuring lines of code/number of commits.


Measuring bugs and features leads to other equally horrendous problematic behaviors.


> Its not hard to measure actual useful things like features or bug.

Presumably your consulting business has really taken off now that you have such unique insight into a historically unempirical area of our field?


Obligatory Bill Gates quote:

Measuring programming progress by lines of code is like measuring aircraft building progress by weight.


I like this quote because it's so accurate. LOC does provide some information. You'd be unhappy if either LOC or weight were close to zero.

That said, one of my favorite metrics is number of deleted LOC.


Literally the next sentence: """Similarly, the number of critical issues arising from mobile releases is almost constant regardless of the number of deployments, indicating that our code quality does not suffer as we continue to scale."""

This is 'productivity' in the sense of the release eng team attempting to measure the volume of code changes they're pushing out and whether changes they made to the push/release process impacted other teams' ability to iterate. Unless you have some insight as to why adopting a more continuous deployment process would change the ratio between LOC pushed or code-changing-diffs pushed to features-or-bugs specifically at Facebook such that they would not function as acceptable proxies, questioning the metric chosen seems hollow. It's certainly not 'productivity' in the sense of whether people on product teams are considered to have done a good job or are deserving of bonuses / promotions.


I think what they are trying to measure is how much is this build/release system effecting productivity of the engineers and changes their behavior. Not how productive the engineers are per say.

If the build system is super slow and just breaks all the time, your going to see a drop in the number of commits landed per day. If git is completely down and commits per day drop per 0, then you have an outage ;)


Some people, perhaps you as well, take offense at LOC metrics because they're (in theory) easy to game, via redundant code, whitespace, verbose vs concise languages, etc. And because good developers know that small code and systems mean fewer opportunities for lurking bugs and points of failure.

BUT, if a team of developers is not padding their code, nor has shitty programmers who write page after page of nonsense, then in the end the proportion of functionality to LOC should be fairly stable, and thus LOC is a useful and meaningful proxy metric for the amount of new or changed functionality in a codebase.


I'm not offended I just think its shit.

Its not difficult to track "velocity" via story points,features or bugs and is at least a step in the right direction.

LOC going down is often adding the most value.


Story points vary wildly between teams. I'd argue that they're a worse metric than LOC


Sure, deleting unused or bad code is a good thing, or refactoring or rewriting something more tightly. But you can't implement a new feature just by deleting code. I've never been in a codebase that did more and more by getting smaller.

People also take offense because a management weenie can misread the metric and penalize people who delete code, or write concise code, or solved a super tricky critical bug after a month of debugging with a one-line change.

But I still claim, these all normalize in the end. If a developer on my team has committed only 30 lines in three months, sorry but I'm gonna be skeptical of their actual productivity.


> If a developer on my team has committed only 30 lines in three months, sorry but I'm gonna be skeptical of their actual productivity.

I've just committed 3 lines that fix a bad intermittent race condition that took over a month to track down (partially because I can't test it and have to rely on other people who aren't 100% available). Is that unproductive?


It's not.

On the other end, I'd fully expect someone doing that producing : - 3 lines of fix - A comprehensive documentation of what happened here, how to diagnose it and track it down, and how to avoid it in the future, (ideally if the company supports it with an oral explanation to concerned teammates).

In the end, I'd be very skeptical of one month of work letting 3 LOC as the only tangible output too.


> A comprehensive documentation of what happened here, how to diagnose it and track it down, and how to avoid it in the future

The problem and diagnostic conversations are documented in the JIRA ticket; the solution is documented in another ticket which "fixes" the first one (for abstruse internal reasons); and more commentary is provided in both the comments around my fix and my verbose commit message.

(Also there's a whole bunch of angry Slack ranting about this issue but that probably doesn't count as "documentation".)


I never meant to say you did a bad job (looks like you certainly did a great one !), but just that a month of work typically leaves in the end, for good reasons, more than 3 LOC in "committed" (in the sense of accessible to the company as an asset in the long term) work, which was apparently the case for you too !


> leaves in the end, for good reasons, more than 3 LOC in "committed" work

I suppose it depends what people mean by LOC. I guess my 3 LOC isn't actually just those 3 LOC. Definitely an interesting thing to think about at least. Ta!


During the time those others arent available as you say I would expect you to be going through the backlog and looking for other ways to contribute.


That assumes there's sufficiently small chunks of work in the sprint (we're not allowed to touch anything out-of-sprint) to fit into the gaps whilst I'm waiting.


You may want to push back on that. I understand the reasoning but anytime upper management tells people not to work on stuff there is usually a communication breakdown some where.


> But you can't implement a new feature just by deleting code.

No, but you can implement a "story", i.e. the removal of code is presumably a ticket somewhere in the system. (Or could become one.)

> I've never been in a codebase that did more and more by getting smaller.

Run on a smaller system? Run faster?

(I'm not sure whether or not I agree with the parent's point, nor yours, just offering suggestions.)


> If a developer on my team has committed only 30 lines in three months, sorry but I'm gonna be skeptical of their actual productivity

If LOC is what you rely on, I'd be sceptical of many things, sorry not sorry.


>Both of those metrics are incredibly shit ways to measure productivity.

In theory yes. In practice, when you know your team and their average output and quality, no.

>I guess thats one explanation as to why the facebook app is 200mb+. They've been superproductive with all those lines of code

Or you know, most of it is assets, debugging symbols, stray frameworks and libs not stripped, etc. As is the case with almost any app beyond a few MB.


> Both of those metrics are incredibly shit ways to measure productivity.

LOC or commits are certainly bad ways of comparing two different people, teams, etc. But they are fine ways of measuring whether a specific event at a specific time changed productivity, assuming a decently large team etc.


Not when they know that's what they're being measured on...


Sure, nobody should setup incentives based on LOC. But in this case it's not about incentives, it just about seeing if something changed. Nobody writing code is going to be rewarded or penalized if this number changes.


That's not really relevant; if they knew they were being measured before and after the change, it would still show if the new system had an impact on productivity - even faked.


I'd guess LoC changed is implicit in "lines of code pushed", which is a blunt measurement, but isn't the same as the traditional LOC metric and doesn't directly correlate to (rewarding) a large download.


Better measure would be number of tests passing that prove business logic.


Better metric for a business must include revenue.


I must be really out of the loop because I just don't see enough of what Facebook is doing to justify all of this garbage. There's a lot of lip service in the article about user experience, and I guess there are some changes here and there, but wtf is happening here? I definitely want and use CI/CD tools for my team's software, but what the fuck are you really doing when you are making this many changes per day?

Call me an old fart, but if you are in a situation where you need to make that many changes per day, you are utterly fucked from almost every angle.

Every aspect of this article sounds to me like people have no idea what they are trying to do, so they write code and push it, and it goes live. And everyone is very happy about this, for some insane reason.

No offense to anyone, but this is not a reality I want to live in. And the article doesn't do much to defend the notion.


- They are running a lot of tests to see what works.

- They are doing a lot of fixes. The number of bugs found is proportional to the number of users you have and number of changes you are making.


The parent makes a point.

Across their whole product line (client apps, Hip Hop VMs, Flow, whatever), there might be millions of lines of code.

But what exactly seems to actually change year over year on FB itself (server/client of the actual social website) that warrants so many commits?


I would suspect that the engagement level of the site changes. They have the audience to be able to run massive amounts of experimentation to see how people use the site and respond to advertising. To the average user, not much changes except for a feed of news.


Exactly. Only yesterday, my partner was complaining about how the like icons on her FaceBook page had become animated. Mine were static at the time - clearly FB had put her in a test bucket for some 'do animated like icons increase engagement?' test.

FB does this stuff all the time.


"Just because you can doesn't mean you should", seems like they forgot about this principal. I think it's important that you can commit and push to prod quickly, shit happens and you need to be able to fix it quickly. But I agree with you, I'm a user not an alpha (or dev) tester.


The massive scale is getting massive at a massive rate.


Hey Facebook, rapid release is great, but you broke my Messages, Notifications, Quick Help, and caret buttons on my https://www.facebook.com navigation bar. It's been broken for a few hours now, clicking on it does absolutely nothing. Chrome 60 browser, macOS.

Maybe you need to slow down.


I see this happen not infrequently. Clearly their values still weight "move fast & [don't worry too much when we] break things" over reliability. Which for a site of this nature is a somewhat defensible choice, but for those of us who come from enterprise software for example, or expect a reasonably stable user experience, it's occasionally between disconcerting & annoying


Nope, just means that they didnt measure that particular metric. It gets better over time.


Wow, lots of Facebookers downvoting me :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: