One of the undersold (imho) parts of this post is the system that allows tiering of releases with gates for their release. When most companies talk about CI/CD, they mean that master gets deployed to production, full stop. Rollbacks mean changing the code. In reality, when code hits master, there is ALWAYS a lag while it gets deployed, and it's worth having a system that holds that source of truth. Where release engineering gets interesting is how you handle the happy path vs. a breaking release.
I like that Facebook separated out deploy from release. It means that you can roll the release out relatively slowly, checking metrics as you go. Bad metrics mean blocking the release, which means turning off the feature via feature flag. I think for the rest of the world, that would mean halting the release and notifying the developer.
This is true, except it has a huge underlying requirement: that all deployments are forwards and backwards compatible. i.e. a running service must be able to talk to the older version of itself, and vice versa (and of course the chain of dependencies). This is a much bigger knowledge investment, easier said than done.
It pays off in the end, but not worth making it a "criteria for success" when breaking out from branch-based to trunk-based continuous delivery, otherwise the trunking will most likely end up never happening.
shameless plug: at goeuro.com we shifted from branch-based to trunk-based CD in a short time (<3 months) with a diverse set of services and workloads, by applying a holistic socio-cultural, technical and process approach. Could be of interest if anyone is trying to make a switch: https://youtu.be/kLTqcM_FTCw
This is what we decided to do as well at my work. We've been using LaunchDarkly to handle the feature flagging, and we slowly migrate features from 0% to 100%, or first roll out features to less important customers, etc.
Works like a charm, our product owner and engineering team are happier than ever, since it takes away a big part of the risks and makes rollbacks a lot faster.
This is in fact exactly what we (https://www.turbinelabs.io) do. With enough proxy work you can tee some traffic to version Y and observe success rate/latency, then discard the response (or compare it with version X's response), at least for idempotent requests.
But you are doing something, writing then releasing software, regardless. It's silly to claim that "not making mistakes" is easier than "making mistakes".
I like that Facebook separated out deploy from release. It means that you can roll the release out relatively slowly, checking metrics as you go. Bad metrics mean blocking the release, which means turning off the feature via feature flag. I think for the rest of the world, that would mean halting the release and notifying the developer.
Disclosure: I work with smart people who spend lots of time thinking about this and writing blog posts like "Deploy != Release": https://blog.turbinelabs.io/deploy-not-equal-release-part-on...