> *Frequent upgrades amortize the cost and ensure that regressions are caught ea...

fluidcruft · on Jan 27, 2015

Many times you have to hold back software upgrades on things like MRI scanners to wait for multi-year research studies to complete, and often new studies start up in the interim so that locks you down for even more time. Scanner upgrades change all sorts of things in ways that introduce all sorts of confounds.

Not to mention that in the real world scanner upgrades often break surprisingly fragile clinical workflows. Technically, the engineering and processing of the scanners are improved quite a bit in one aspect or another by the upgrades, but old workarounds need to be replaced by new workarounds etc and documentation is very sparse and quite uninformative.

Figs · on Jan 28, 2015

Hmm... what does an MRI scanner update do? I would have thought a system like that would just record whatever it gets from its sensors and any updates needed would only apply to the analysis and visualization software... Do the updates actually modify what the hardware does during scanning? Or, is it all painfully coupled because some sort of interactivity is required during scanning?

btilly · on Jan 28, 2015

MRIs do not work like you think they do.

A single measurement is that you magnetize the body in a particular pattern, then watch how that magnetic pattern fades. Then do it again in another pattern and repeat. Think of the patterns as being terms in a Fourier series, which you eventually will do a Fourier transform on to get the original thing.

The name of the game, therefore, is to be able to get away with as few measurements as possible, and to be able to perform measurements here while still recovering from measurements there. Oh, and while we're at it, let's try not to be thrown off from things that tend to move around. Like arteries do every time another heartbeat comes through.

So yes..there is a lot of interactivity in an MRI measurement.

EpicEng · on Jan 29, 2015

You may think they are simple machines which do one job and pretty much never need to be changed (not an unreasonable assumption), but that's not the case. I work in the medical device field and, well, you still have to sell instruments. To do that you need to beat the competition. To do that you need more features which make the doctor's/tech's life easier and the diagnosis more accurate.

That doesn't mean upgrading is easy. At the 501k/PMA level it pretty much always requires a re-filing, so you try not to do it often. But you do improve the product over time.

fencepost · on Jan 28, 2015

Probably improved image processing - noise removal, sharpness, could be a whole range of things, possibly down to something as seemingly simple as changing motor stepping for some of the actual moving parts.

edofic · on Jan 27, 2015

"Since there's no way to roll back server upgrades"

It is if you run a modern filesystem like ZFS or btrfs. You just do a cheap snapshot before upgrading(can be automated) and roll back if there are problems. Even works with lvm.

KaiserPro · on Jan 28, 2015

Sadly as zfs isn't as widely available as I'd like, you can use LVM to provide snapshots.

Its not as friendly as ZFS snapshots, but it is at least available in centos 5 https://www.centos.org/docs/5/html/Cluster_Logical_Volume_Ma...

(sadly somethings need long term support)

lsc · on Jan 28, 2015

rolling back a LVM snapshot involves dding off the snapshot, and on to whatever you want your production disk to be (or just running off the snapshot forever, which has.... performance consequences with LVM.)

Yes, LVM snapshots exist, but they are of limited utility compared to ZFS and the like.

I've been experimenting with CentOS6 and ZFS on Linux; so far it looks pretty good. it handles failing consumer grade hard drives vastly better than lvm on md, and snapshots are inexpensive.

b101010 · on Jan 28, 2015

You can also roll back lvm snapshots using the merge option

https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

lsc · on Jan 28, 2015

nice. I had not seen that before... that makes it a lot more useful.

KaiserPro · on Jan 28, 2015

Oh yes, LVM is abhorrent, along with mdadm.

Thats why I'm not so happy about the direction the tools are going with BTRFS. If ZFS isn't around, LVM is your only real option.

However, RHEL usually sets up LVM, so you might as well use it.

thaumaturgy · on Jan 27, 2015

Ooh, thanks. That's a good idea. I'd never considered running ZFS on my small servers.

keithpeter · on Jan 27, 2015

Or perhaps simply gzip of the / partition before upgrading. If insuperable problems, zap new / and restore old known working one?

Assuming data/sites are on different partitions.

kbenson · on Jan 28, 2015

While initially conceptually easier to grasp, that is far inferior to using a snapshot.

Here's a short list of ways in which that may cause you problems:

1) gzip of a path is not point in time, synced files may no longer be in sync since they were backed up at slightly different times (e.g. I hope you didn't expect database consistency to actually mean anything).

2) gzip of a path will take a while, because it has to actually function on every file (a snapshot is generally copy on write, meaning it's "free" (not quite) for every file until it's changed. Throw away the snapshot before a change and there's no need to copy the file.

3) gzip will take more room (see #2)

keithpeter · on Jan 28, 2015

Excellent reply. The implications of a significant number of transactions per minute had escaped me.

davexunit · on Jan 28, 2015

Or, you could try out NixOS/GuixSD, which support transactional upgrades and rollbacks for the full system. No need to take a disk image (outside of your normal backup routine, of course).

vacri · on Jan 28, 2015

btrfs still isn't production ready. I went to a talk from a btrfs dev two weeks ago, and it was still "use btrfs with caveats", not "just use btrfs".

vel0city · on Jan 28, 2015

Or, if you're using VMs, just take a snapshot and roll back if something doesn't work right. For anything that isn't virtualized, there's ZFS.

lsc · on Jan 28, 2015

>Eventually you get burned. For me, it was a routine should-have-been-minor web server update where one of the packages I relied on suddenly became unsupported and every single hosted site stopped working. Since there's no way to roll back server upgrades, I had a marathon night involving building a new server stack and migrating all hosted sites there by 8 a.m.

The traditional way to handle this in a cluster is with testing. The idea being that you have the upstream repo, the testing repo, and the production repo.

Note, in most cases 'repo' means "directory tree served via http" and "sync" means "Copy, you know, with rsync or something" - This is not complicated.

10% of your servers point at the testing repo, the rest at the production repo.

every X days, you set a test box against the upstream repo, update, reboot, and run your tests, you know, to catch the obvious stuff. If that works, you sync your testing repo with the upstream repo, and so 10% of your production now runs your new stuff. This is where you are gonna catch most of your problems, in my experience, but if something chokes, you only lose 10% of capacity.

after Y days of the test stuff being on 10% of your boxes, you sync the test repo to the production repo.

Of course, if you are like me, and when 10% of your boxes are down 10% of your customers are down, you want to spend a lot more effort on the 'test before you get customers on it' step.

Also, sometimes there are 'don't sleep until you roll it out everywhere' updates, like this one (oh god, I am so happy that srn is on that now and I didn't have to deal with it) or like shellshock. In that case, well, sometimes you sync the upstream repo straight to production.

yellowapple · on Jan 27, 2015

Even with OpenBSD that isn't always true. The time_t fixed killed binary-compatibility between 5.4 and 5.5 on 32-bit systems, which meant that any installed packages had to be uninstalled and reinstalled if one were to perform an upgrade. The OpenBSD project isn't afraid to break compatibility if it means fixing a bug - something that I think is a very good thing, but has implications in terms of support.

On another note, upgrade hell is a pretty convincing argument for systems like NixOS, where upgrades can be easily rolled back, configurations are declarative, etc.

PythonicAlpha · on Jan 28, 2015

I also can not recommend the "frequent upgrades" model.

When systems would be more stable and less problems would occur at updates, this would be right. But the reality is simply different. My experience is, that on many updates there are surprises:

- software that once was good, just has gone garbage since the last version

- some problems with not so common software combinations that was not found from package maintainers

- some device drivers that do not cooperate

- desktop environments that do no longer support some options as before or just have gone bad

- legacy data is not supported by newer application versions or some subtle problems with this data occur ...

- ....

Also, when upgrades would be possible without hassle and troubles, that model could work -- but it just is not. For example: I wanted to install a newer Ubuntu version on my hosted server. But the automatic upgrade process explicitly says, that it should not be done via a remote session. So for a hosted system, I have to fall back to a complete new install (backup data, fresh install, complete new configuration, re-install backed up data).

Also, when you get into trouble, it is not possible to easily go back to the last stable version (in this situation, virtualized systems are very useful).

In such a case, it is clear, that I don't want to give up my life to always have the newest stuff on my server.

I regret, that OpenBSD is not for me (unless OpenBSD does never suffer from those troubles).

marcosdumay · on Jan 27, 2015

> and because once the damage is done, undoing it is horrifyingly difficult

That's the one reason why I'm looking at NixOS. There's no inherent reason for it to be hard.

ryantrinkle · on Jan 28, 2015

Go for it! I've been using NixOS personally since February 2014 and in production since September, and it's been fantastic.

vacri · on Jan 28, 2015

But OpenBSD is the exception here, not the rule.

It's worth mentioning that OpenBSD doesn't support a lot of the hardware people use, doesn't support a lot of the applications people use, and is a fairly specialist distro. They do some awesome things, but they also benefit greatly from not having to support regular non-techie end-users.

Edit: As an example, the upgrade process that you have to do at least once a year[1] is hardly something that a tech naif would find painless

[1]http://www.openbsd.org/faq/upgrade56.html

4ad · on Jan 28, 2015

And what hardware it does not support? And what applications it does not support? I'm tired of this meme. OpenBSD runs on virtually everything you throw it at, and virtually all open source is supported.

In some cases OpenBSD hardware support is better. E.g. I have a few laptops where suspend only works in OpenBSD, but not Linux. Also a few years back my WiFi cards were supported natively only in OpenBSD, not Linux. For other hardware, it might be the other way around, but overall it's pretty unlikely to find something OpenBSD does not run on.

And for software, it's exactly the same. Sure there's some Linux-specific software out there, but the bulk of it, is not.

vacri · on Jan 28, 2015

Hipchat and Skype were two popular applications that I used today that don't run on OpenBSD. I ran them on my laptop with an nvidia GPU, which isn't wonderfully supported by linux, but it's even less supported by OpenBSD. Nvidia is a pretty damn common brand. Flash, for all it's sins, is still popular and not supported properly. Steam is another popular application that doesn't work on OpenBSD. Admittedly you said open source, but frankly, I didn't. I said 'applications people use'. It's utter bullshit to move the goalposts and then chide me for inaccuracy.

Then there's virtualisation software in general, which is increasingly popular and widespread, which OpenBSD doesn't support well, if at all (kvm, xen, vmware, virtualbox, and friends. qemu by itself is s.l.o.w.). Docker and other containers are really taking off at the moment and have a lot of mindspace, though admittedly these are linux-specific. Openstack is another significant emerging bit of software that doesn't support BSD as a host.

Then there's plenty of stuff like this http://blog.lxde.org/?p=1111 where OpenBSD could be better supported but is a broken experience. Legitimate reason, sure (not enough eyes), but it's still a broken experience.

mathattack · on Jan 27, 2015

I worked a firm that supported two main branches that forked >5 years apart. It was a nightmare fixing both simultaneously, especially after the fork was 5+ years old. In the end they wound up breaking something with every new functionality.

joncrocks · on Jan 28, 2015

> Eventually you get burned. For me, it was a routine should-have-been-minor web server update where one of the packages I relied on suddenly became unsupported and every single hosted site stopped working.

I'm not wishing to be inflammatory here, but surely you could have exactly the same problem with something that is updated 'slowly'. If you're relying on the software you get being 100% correct/bug free every time you get it, you're building your house on sand.

This is where having testing (ideally a good set of automated testing) is invaluable. Having a robust set of tests you run before you roll out changes to 'production' is important if your 'production' is "we can't afford for this to go down."

Along testing, you need to have a backout strategy. What do you do if it all goes wrong? This is usually very similar to your backup/restore strategy as the problem is generally the same: If your server gets hosed/breaks/fails, what do you do?

hermanradtke · on Jan 27, 2015

>> Frequent upgrades amortize the cost and ensure that regressions are caught early. No one upgrade is likely to end in disaster because there simply isn’t enough change for that to happen.

> Oh, how I wish this were true.

Your anecdote is even more evidence that it _is_ true. Most minor upgrades do not end in disaster. Few of them do end up in disaster.

Frequent updates still require some diligence on the part of the person or organization updating.

AlisdairO · on Jan 28, 2015

The difference with irregular upgrades is that it's clear when the breakage is likely to occur and you can plan around that.

xorcist · on Jan 28, 2015

> Since there's no way to roll back server upgrades,

Apart from the obvious comment about snapshots (at the volume or machine level), really most of the time your problems are because of a specific package.

And while most package managers don't support downgrades per se, you can just remove the offending package and install the old one. Nine times out of ten that would give you more time to fix the problem.

That said, you need to test before you go to production, no matter how trivial the patch may seem. Staged rollouts, a separate environment, or both.

And that's also the elephant in the room in Ted's rant. Sure, if we didn't have to test anything, we could update every 6 months. But we have to, and we can't.

rlpb · on Jan 28, 2015

> I had a marathon night involving building a new server stack and migrating all hosted sites there by 8 a.m.

This is because of a shortcut that somebody took when building your server originally, by failing to make the deployment reproducible.

> ...where one of the packages I relied on suddenly became unsupported

I've never heard of this happening in any Linux distribution. Can you be more specific? Did you choose to use some third party source for a package here? If so, how can you expect your distribution's developers to support you going off-piste in a way that they never claimed to support in the first place?

> Old, dodgy, out-of-date servers exist exactly because updates are butthole-puckering, because everyone's been burned at least once by a "minor" update, and because once the damage is done, undoing it is horrifyingly difficult.

You might want to look into "DevOps". The idea is that you script your deployments together with automated tests for them. There are many tools to help you do this now. With this in place, nothing you've stated is true any more.

thaumaturgy · on Jan 28, 2015

> I've never heard of this happening in any Linux distribution. Can you be more specific?

It was a couple of years ago. I know somewhere I have notes on it, but I can't find them just this minute. I remember that it had something to do with one of the components in my apache-mpm-worker--php5--libapache2-mod-fcgid--apache2-suexec-custom--libapache-mod-security stack. It was something like, mpm-worker no longer supported libapache2-mod-fcgid or some such thing. At the time, I was really good about doing regular server updates, so when it happened, I spent some time researching it but eventually found the package had been removed during the update and was no longer supported, with no workaround aside from finding a new way to build an apache server.

Either it got fixed or I'm running a slightly different stack now than I was at the time. Sorry I can't be more helpful.

> This is because of a shortcut that somebody took...

> You might want to look into "DevOps"...

That somebody was me, and I'm aware of devops. Pretty big fan of it actually. Unfortunately, I'm just a small MSP, the owner and the senior tech and the sole software developer and the sysadmin, and I don't charge enough. The servers exist as an add-on service for my clients, especially ones that have special needs that other hosting companies can't easily meet. They work, the stacks I built a few years ago are robust and efficient (one of my customers was featured on a popular national radio show with a reputation for killing websites, but their site stayed up and responsive the whole time). Reproducible deployments, centralized management, further improvements to automated security, etc. are all on my to-do list -- along with like a couple dozen other things.

I was supposed to rebuild all of the servers during last December, typically a slow period, but instead it was unseasonably, hair-on-fire, brain-meltingly busy, and it still hasn't let up.

rlpb · on Jan 28, 2015

> It was something like, mpm-worker no longer supported libapache2-mod-fcgid or some such thing.

No stable release distribution ever does this within a release, unless there is a security issue that cannot be fixed any other way. In this case would you prefer to have remained vulnerable?

Of course, mistakes can happen, but they would be fixed in a further regression update, or you could have even looked at fixing the bug yourself.

Based on the package names you're likely talking about Debian or Ubuntu. In both cases, you could have just downgraded the packages for a quick (albeit temporary) end to your emergency.

sorbits · on Jan 28, 2015

> I've never heard of this happening in any Linux distribution. Can you be more specific?

Not the OP but I run a personal server for mail etc., and I do remember one LTS → LTS upgrade of Ubuntu that removed the DKIM milter that my postfix config depended on.

So it happens.

> You might want to look into "DevOps". The idea is that you script your deployments together with automated tests for them

For a personal server the things that in my experience break are related to configuration files.

This can be subtle things, like some new option enabled by default that conflicts in some edge case, it can be upgrade scripts that butcher existing customized configuration files, or it can be a total restructure of how a package has structured its config.

All of this has happened to me, and I don’t think having scripts able to redeploy my server would have been of any help resolving these issues.

rlpb · on Jan 28, 2015

> ...and I do remember one LTS → LTS upgrade of Ubuntu...

That's different. Supported features do change between distribution releases. But you don't ever need to do an emergency security update between releases. You have years to plan for a release upgrade. The grandparent was referring to unexpected emeregency breakages during updates within a release, which is an entirely separate thing.

PythonicAlpha · on Jan 28, 2015

For people with >2 servers, that might be the right solution, but for the others, scripting deployment might be just more overhead.

derekp7 · on Jan 28, 2015

I can see that, especially if someone is administrating their own small site, and hasn't had experience in larger shops. But at the minimum (and since it is cheap enough) you should have at least 3 - 5 servers (esp. if you are making money off them) -- dev, test, staging, production, and failover. Just breaking the mirror to the failover box, and upgrading the primary would allow for an easy and quick backout procedure.

Oh, and there is another upgrade trick (I'll have to see if I still have my old writeup on it, and post it as a Gist). You can query the package manager, and make a few lists. First, list the packages (and versions) that are installed. Secondly, get a list of any file owned by a package that has changed since package installation (compare the file MD5 sum to the package's record of same). This should be a small list of files (mostly configuration related) that can be backed up. This gives you a good way to roll back changes if needed, and keep a system documented.

Finally, with Yum, you can roll back updates. Take a look at "yum history", and "yum history undo". This has saved me a couple times.

PythonicAlpha · on Jan 28, 2015

I must acknowledge, that I don't have so much administration knowledge as you have. I come more from the development corner. And I really "hate" that stuff. I also have to say, that what you write is probable right. But I think, there are some people like me, that have one or two servers for some projects running (and I know even people, that have websites or even servers running with even less knowledge) who just want that the stuff works and don't have time or ambition to optimize things.

Do you know, if this rollback stuff is also available for Debian based systems? Sounds pretty good. I regret, I don't know yum, it is for RPM-based systems, I know, but the last systems I used where all Ubuntu and Debian.

derekp7 · on Jan 28, 2015

I haven't done a lot with Debian based systems -- I've traditionally been a Slackware / DIY / Redhat person. I just looked through the documentation and a bit of the source code for apt-get, and nothing popped out at me for a rollback feature. If I come across anything I'll update this comment for you.

But I know what you mean about hating the sysadmin side of things. Of course there are people that really love it too -- there's sort of a mindset that you have to get into, just like with development. Maybe this could be an idea for a new service -- matching up programmers with side projects, with sysadmins that are looking for side projects to help manage.

PythonicAlpha · on Jan 28, 2015

Yea, that would be really a great idea. I have a friend, whom I ask for help some time, but he also has only limited time.

And I lack the time to really dive into automation stuff and that sort. So many interesting tools are available, but you always have to take time and learn it first.

thaumaturgy · on Jan 28, 2015

> matching up programmers with side projects, with sysadmins that are looking for side projects to help manage.

That sounds like a fantastic idea. Even in my case, I do "OK" as a sysadmin most of the time, but would love some occasional guidance or assistance.

thaumaturgy · on Jan 28, 2015

> Do you know, if this rollback stuff is also available for Debian based systems?

Over half the servers I admin are running some flavor of Debian, and have for at least five years or so. To the best of my knowledge there's no rollback method for Debian updates, short of relying on filesystem tricks like using ZFS, discussed upthread.

PythonicAlpha · on Jan 28, 2015

Thank you for your information! That would be really a fine feature. I love Debian for its stability (as it is said ... I just lack time, to compare myself) but the administration tools are still a black box for me and I have the feeling, that they could need a brush-up.