I'm less enthused about 'rollbacks' being considered 'normal'. They signify some...

nurettin · on April 4, 2017

You are a CTO sitting atop very expensive hardware and software. Would you start removing deployment and runtime safety guards (such as a consumer-facing staging environment) because you want to "discipline coders and devops"?

d4mi3n · on April 4, 2017

A post-mortem should never be about placing blame on individuals, it should be about identifying flaws in a system or a process.

There are places where post-mortems can turn into blame games, but in my experience such things are counter-productive to actually solving problems. Luckily, there are plenty of engineering organizations that do not make this mistake! :)

qqg3 · on April 5, 2017

The easiest way to avoid that is to have well structured post-mortem process, and post-mortem everything. Successful and unsuccessful releases.

ojilles · on April 5, 2017

We need to go from postmortem to postpartum!

solatic · on April 4, 2017

On the one hand, rollbacks need to be culturally normal. Finding out that an essential part of your process (essential because it keeps your mean-time-to-repair low) is undependable because the last time you ran it was a half a year so, and right now is precisely when you need it to be dependable, well, that royally sucks.

On the other hand, what you're talking about shouldn't be that hard to implement. Just hook your rollback system into your issue tracker to create a post-mortem issue whenever a rollback is necessary, and assign it to whoever initiated the deployment (or their manager). Easy.

On the third hand, you might end up finding that a lot of your post-mortems end up looking something like "we don't have reproduceable builds and we made a managerial decision not to invest in that now". And now you just created a ton of recurring paperwork for everyone with little benefit.

friendzis · on April 4, 2017

It is not that hard to imagine that every failed request/action in your application costs some real money to the client (cannot complete sale, must record data locally and then reenter, etc). If that can be regressed (fines, compensations, lost sales) to vendor that is sort of increased operation costs for the duration of outage. Can you get that sure of your pre-release process as not to have ability to 'rollback' to cheaper (earlier) state? I guess not.

While I agree on your points that it is better to catch errors earlier in the pipeline and the necessity of mini-postmortems, I personally think that rollbacks are inevitable and compare them to backups. Of course it is better not to restore from backups, it is easy to rationalise good processes over a few metric tons of never been read backup tapes, but a single accidental drop of production database may quickly pay for all the effort put into ensuring backing up works.

taeric · on April 4, 2017

If it was an automated rollback, a system that automatically generated either test data or other heatmap of the system when the rollback was called would be awesome.

It is amusing, because we are basically saying it would be awesome to have a coredump when the crash happens. Which... used to be standard behavior but was essentially lost in most modern development environments. (Not to mention the tooling lagged in two directions. One to help you with the coredumps, and the other to make usable core dumps.)

_pmf_ · on April 4, 2017

> They signify something didn't go quite right with your unit/integration/qa process

Staged rollouts are part of the QA strategy (whether this is something to be aspired is another question).

Xylakant · on April 4, 2017

Given that it's impossible to have the real world simulated in qa, it's probably the best you can do. There's just no way to have all configurations for all clients in your test lab.

mabbo · on April 4, 2017

[Deleted because I'm being a dickhead, and can see that]

smt88 · on April 4, 2017

This is a straw-man argument. GP didn't argue against rollbacks. S/he argued against considering them to be normal.