> Frequent upgrades amortize the cost and ensure that regressions are caught early. No one upgrade is likely to end in disaster because there simply isn’t enough change for that to happen.
Oh, how I wish this were true.
For what it's worth, it's pretty true as far as OpenBSD is concerned, in my experience. But OpenBSD is the exception here, not the rule. Everywhere else, developers all seem to have embraced "break early, break often".
Eventually you get burned. For me, it was a routine should-have-been-minor web server update where one of the packages I relied on suddenly became unsupported and every single hosted site stopped working. Since there's no way to roll back server upgrades, I had a marathon night involving building a new server stack and migrating all hosted sites there by 8 a.m.
But you can't yell at anybody when that happens, because the answer's always the same: it's not the developers' fault.
Who really believes sysadmins wouldn't update everything all the time if they could? Old, dodgy, out-of-date servers exist exactly because updates are butthole-puckering, because everyone's been burned at least once by a "minor" update, and because once the damage is done, undoing it is horrifyingly difficult.
Many times you have to hold back software upgrades on things like MRI scanners to wait for multi-year research studies to complete, and often new studies start up in the interim so that locks you down for even more time. Scanner upgrades change all sorts of things in ways that introduce all sorts of confounds.
Not to mention that in the real world scanner upgrades often break surprisingly fragile clinical workflows. Technically, the engineering and processing of the scanners are improved quite a bit in one aspect or another by the upgrades, but old workarounds need to be replaced by new workarounds etc and documentation is very sparse and quite uninformative.
Hmm... what does an MRI scanner update do? I would have thought a system like that would just record whatever it gets from its sensors and any updates needed would only apply to the analysis and visualization software... Do the updates actually modify what the hardware does during scanning? Or, is it all painfully coupled because some sort of interactivity is required during scanning?
A single measurement is that you magnetize the body in a particular pattern, then watch how that magnetic pattern fades. Then do it again in another pattern and repeat. Think of the patterns as being terms in a Fourier series, which you eventually will do a Fourier transform on to get the original thing.
The name of the game, therefore, is to be able to get away with as few measurements as possible, and to be able to perform measurements here while still recovering from measurements there. Oh, and while we're at it, let's try not to be thrown off from things that tend to move around. Like arteries do every time another heartbeat comes through.
So yes..there is a lot of interactivity in an MRI measurement.
You may think they are simple machines which do one job and pretty much never need to be changed (not an unreasonable assumption), but that's not the case. I work in the medical device field and, well, you still have to sell instruments. To do that you need to beat the competition. To do that you need more features which make the doctor's/tech's life easier and the diagnosis more accurate.
That doesn't mean upgrading is easy. At the 501k/PMA level it pretty much always requires a re-filing, so you try not to do it often. But you do improve the product over time.
Probably improved image processing - noise removal, sharpness, could be a whole range of things, possibly down to something as seemingly simple as changing motor stepping for some of the actual moving parts.
"Since there's no way to roll back server upgrades"
It is if you run a modern filesystem like ZFS or btrfs. You just do a cheap snapshot before upgrading(can be automated) and roll back if there are problems. Even works with lvm.
rolling back a LVM snapshot involves dding off the snapshot, and on to whatever you want your production disk to be (or just running off the snapshot forever, which has.... performance consequences with LVM.)
Yes, LVM snapshots exist, but they are of limited utility compared to ZFS and the like.
I've been experimenting with CentOS6 and ZFS on Linux; so far it looks pretty good. it handles failing consumer grade hard drives vastly better than lvm on md, and snapshots are inexpensive.
While initially conceptually easier to grasp, that is far inferior to using a snapshot.
Here's a short list of ways in which that may cause you problems:
1) gzip of a path is not point in time, synced files may no longer be in sync since they were backed up at slightly different times (e.g. I hope you didn't expect database consistency to actually mean anything).
2) gzip of a path will take a while, because it has to actually function on every file (a snapshot is generally copy on write, meaning it's "free" (not quite) for every file until it's changed. Throw away the snapshot before a change and there's no need to copy the file.
Or, you could try out NixOS/GuixSD, which support transactional upgrades and rollbacks for the full system. No need to take a disk image (outside of your normal backup routine, of course).
>Eventually you get burned. For me, it was a routine should-have-been-minor web server update where one of the packages I relied on suddenly became unsupported and every single hosted site stopped working. Since there's no way to roll back server upgrades, I had a marathon night involving building a new server stack and migrating all hosted sites there by 8 a.m.
The traditional way to handle this in a cluster is with testing. The idea being that you have the upstream repo, the testing repo, and the production repo.
Note, in most cases 'repo' means "directory tree served via http" and "sync" means "Copy, you know, with rsync or something" - This is not complicated.
10% of your servers point at the testing repo, the rest at the production repo.
every X days, you set a test box against the upstream repo, update, reboot, and run your tests, you know, to catch the obvious stuff. If that works, you sync your testing repo with the upstream repo, and so 10% of your production now runs your new stuff. This is where you are gonna catch most of your problems, in my experience, but if something chokes, you only lose 10% of capacity.
after Y days of the test stuff being on 10% of your boxes, you sync the test repo to the production repo.
Of course, if you are like me, and when 10% of your boxes are down 10% of your customers are down, you want to spend a lot more effort on the 'test before you get customers on it' step.
Also, sometimes there are 'don't sleep until you roll it out everywhere' updates, like this one (oh god, I am so happy that srn is on that now and I didn't have to deal with it) or like shellshock. In that case, well, sometimes you sync the upstream repo straight to production.
Even with OpenBSD that isn't always true. The time_t fixed killed binary-compatibility between 5.4 and 5.5 on 32-bit systems, which meant that any installed packages had to be uninstalled and reinstalled if one were to perform an upgrade. The OpenBSD project isn't afraid to break compatibility if it means fixing a bug - something that I think is a very good thing, but has implications in terms of support.
On another note, upgrade hell is a pretty convincing argument for systems like NixOS, where upgrades can be easily rolled back, configurations are declarative, etc.
I also can not recommend the "frequent upgrades" model.
When systems would be more stable and less problems would occur at updates, this would be right. But the reality is simply different. My experience is, that on many updates there are surprises:
- software that once was good, just has gone garbage since the last version
- some problems with not so common software combinations that was not found from package maintainers
- some device drivers that do not cooperate
- desktop environments that do no longer support some options as before or just have gone bad
- legacy data is not supported by newer application versions or some subtle problems with this data occur ...
- ....
Also, when upgrades would be possible without hassle and troubles, that model could work -- but it just is not. For example: I wanted to install a newer Ubuntu version on my hosted server. But the automatic upgrade process explicitly says, that it should not be done via a remote session. So for a hosted system, I have to fall back to a complete new install (backup data, fresh install, complete new configuration, re-install backed up data).
Also, when you get into trouble, it is not possible to easily go back to the last stable version (in this situation, virtualized systems are very useful).
In such a case, it is clear, that I don't want to give up my life to always have the newest stuff on my server.
I regret, that OpenBSD is not for me (unless OpenBSD does never suffer from those troubles).
It's worth mentioning that OpenBSD doesn't support a lot of the hardware people use, doesn't support a lot of the applications people use, and is a fairly specialist distro. They do some awesome things, but they also benefit greatly from not having to support regular non-techie end-users.
Edit: As an example, the upgrade process that you have to do at least once a year[1] is hardly something that a tech naif would find painless
And what hardware it does not support? And what applications it does not support? I'm tired of this meme. OpenBSD runs on virtually everything you throw it at, and virtually all open source is supported.
In some cases OpenBSD hardware support is better. E.g. I have a few laptops where suspend only works in OpenBSD, but not Linux. Also a few years back my WiFi cards were supported natively only in OpenBSD, not Linux. For other hardware, it might be the other way around, but overall it's pretty unlikely to find something OpenBSD does not run on.
And for software, it's exactly the same. Sure there's some Linux-specific software out there, but the bulk of it, is not.
Hipchat and Skype were two popular applications that I used today that don't run on OpenBSD. I ran them on my laptop with an nvidia GPU, which isn't wonderfully supported by linux, but it's even less supported by OpenBSD. Nvidia is a pretty damn common brand. Flash, for all it's sins, is still popular and not supported properly. Steam is another popular application that doesn't work on OpenBSD. Admittedly you said open source, but frankly, I didn't. I said 'applications people use'. It's utter bullshit to move the goalposts and then chide me for inaccuracy.
Then there's virtualisation software in general, which is increasingly popular and widespread, which OpenBSD doesn't support well, if at all (kvm, xen, vmware, virtualbox, and friends. qemu by itself is s.l.o.w.). Docker and other containers are really taking off at the moment and have a lot of mindspace, though admittedly these are linux-specific. Openstack is another significant emerging bit of software that doesn't support BSD as a host.
Then there's plenty of stuff like this http://blog.lxde.org/?p=1111 where OpenBSD could be better supported but is a broken experience. Legitimate reason, sure (not enough eyes), but it's still a broken experience.
I worked a firm that supported two main branches that forked >5 years apart. It was a nightmare fixing both simultaneously, especially after the fork was 5+ years old. In the end they wound up breaking something with every new functionality.
> Eventually you get burned. For me, it was a routine should-have-been-minor web server update where one of the packages I relied on suddenly became unsupported and every single hosted site stopped working.
I'm not wishing to be inflammatory here, but surely you could have exactly the same problem with something that is updated 'slowly'. If you're relying on the software you get being 100% correct/bug free every time you get it, you're building your house on sand.
This is where having testing (ideally a good set of automated testing) is invaluable. Having a robust set of tests you run before you roll out changes to 'production' is important if your 'production' is "we can't afford for this to go down."
Along testing, you need to have a backout strategy. What do you do if it all goes wrong? This is usually very similar to your backup/restore strategy as the problem is generally the same: If your server gets hosed/breaks/fails, what do you do?
>> Frequent upgrades amortize the cost and ensure that regressions are caught early. No one upgrade is likely to end in disaster because there simply isn’t enough change for that to happen.
> Oh, how I wish this were true.
Your anecdote is even more evidence that it _is_ true. Most minor upgrades do not end in disaster. Few of them do end up in disaster.
Frequent updates still require some diligence on the part of the person or organization updating.
> Since there's no way to roll back server upgrades,
Apart from the obvious comment about snapshots (at the volume or machine level), really most of the time your problems are because of a specific package.
And while most package managers don't support downgrades per se, you can just remove the offending package and install the old one. Nine times out of ten that would give you more time to fix the problem.
That said, you need to test before you go to production, no matter how trivial the patch may seem. Staged rollouts, a separate environment, or both.
And that's also the elephant in the room in Ted's rant. Sure, if we didn't have to test anything, we could update every 6 months. But we have to, and we can't.
> I had a marathon night involving building a new server stack and migrating all hosted sites there by 8 a.m.
This is because of a shortcut that somebody took when building your server originally, by failing to make the deployment reproducible.
> ...where one of the packages I relied on suddenly became unsupported
I've never heard of this happening in any Linux distribution. Can you be more specific? Did you choose to use some third party source for a package here? If so, how can you expect your distribution's developers to support you going off-piste in a way that they never claimed to support in the first place?
> Old, dodgy, out-of-date servers exist exactly because updates are butthole-puckering, because everyone's been burned at least once by a "minor" update, and because once the damage is done, undoing it is horrifyingly difficult.
You might want to look into "DevOps". The idea is that you script your deployments together with automated tests for them. There are many tools to help you do this now. With this in place, nothing you've stated is true any more.
> I've never heard of this happening in any Linux distribution. Can you be more specific?
It was a couple of years ago. I know somewhere I have notes on it, but I can't find them just this minute. I remember that it had something to do with one of the components in my apache-mpm-worker--php5--libapache2-mod-fcgid--apache2-suexec-custom--libapache-mod-security stack. It was something like, mpm-worker no longer supported libapache2-mod-fcgid or some such thing. At the time, I was really good about doing regular server updates, so when it happened, I spent some time researching it but eventually found the package had been removed during the update and was no longer supported, with no workaround aside from finding a new way to build an apache server.
Either it got fixed or I'm running a slightly different stack now than I was at the time. Sorry I can't be more helpful.
> This is because of a shortcut that somebody took...
> You might want to look into "DevOps"...
That somebody was me, and I'm aware of devops. Pretty big fan of it actually. Unfortunately, I'm just a small MSP, the owner and the senior tech and the sole software developer and the sysadmin, and I don't charge enough. The servers exist as an add-on service for my clients, especially ones that have special needs that other hosting companies can't easily meet. They work, the stacks I built a few years ago are robust and efficient (one of my customers was featured on a popular national radio show with a reputation for killing websites, but their site stayed up and responsive the whole time). Reproducible deployments, centralized management, further improvements to automated security, etc. are all on my to-do list -- along with like a couple dozen other things.
I was supposed to rebuild all of the servers during last December, typically a slow period, but instead it was unseasonably, hair-on-fire, brain-meltingly busy, and it still hasn't let up.
> It was something like, mpm-worker no longer supported libapache2-mod-fcgid or some such thing.
No stable release distribution ever does this within a release, unless there is a security issue that cannot be fixed any other way. In this case would you prefer to have remained vulnerable?
Of course, mistakes can happen, but they would be fixed in a further regression update, or you could have even looked at fixing the bug yourself.
Based on the package names you're likely talking about Debian or Ubuntu. In both cases, you could have just downgraded the packages for a quick (albeit temporary) end to your emergency.
> I've never heard of this happening in any Linux distribution. Can you be more specific?
Not the OP but I run a personal server for mail etc., and I do remember one LTS → LTS upgrade of Ubuntu that removed the DKIM milter that my postfix config depended on.
So it happens.
> You might want to look into "DevOps". The idea is that you script your deployments together with automated tests for them
For a personal server the things that in my experience break are related to configuration files.
This can be subtle things, like some new option enabled by default that conflicts in some edge case, it can be upgrade scripts that butcher existing customized configuration files, or it can be a total restructure of how a package has structured its config.
All of this has happened to me, and I don’t think having scripts able to redeploy my server would have been of any help resolving these issues.
> ...and I do remember one LTS → LTS upgrade of Ubuntu...
That's different. Supported features do change between distribution releases. But you don't ever need to do an emergency security update between releases. You have years to plan for a release upgrade. The grandparent was referring to unexpected emeregency breakages during updates within a release, which is an entirely separate thing.
I can see that, especially if someone is administrating their own small site, and hasn't had experience in larger shops. But at the minimum (and since it is cheap enough) you should have at least 3 - 5 servers (esp. if you are making money off them) -- dev, test, staging, production, and failover. Just breaking the mirror to the failover box, and upgrading the primary would allow for an easy and quick backout procedure.
Oh, and there is another upgrade trick (I'll have to see if I still have my old writeup on it, and post it as a Gist). You can query the package manager, and make a few lists. First, list the packages (and versions) that are installed. Secondly, get a list of any file owned by a package that has changed since package installation (compare the file MD5 sum to the package's record of same). This should be a small list of files (mostly configuration related) that can be backed up. This gives you a good way to roll back changes if needed, and keep a system documented.
Finally, with Yum, you can roll back updates. Take a look at "yum history", and "yum history undo". This has saved me a couple times.
I must acknowledge, that I don't have so much administration knowledge as you have. I come more from the development corner. And I really "hate" that stuff. I also have to say, that what you write is probable right. But I think, there are some people like me, that have one or two servers for some projects running (and I know even people, that have websites or even servers running with even less knowledge) who just want that the stuff works and don't have time or ambition to optimize things.
Do you know, if this rollback stuff is also available for Debian based systems? Sounds pretty good. I regret, I don't know yum, it is for RPM-based systems, I know, but the last systems I used where all Ubuntu and Debian.
I haven't done a lot with Debian based systems -- I've traditionally been a Slackware / DIY / Redhat person. I just looked through the documentation and a bit of the source code for apt-get, and nothing popped out at me for a rollback feature. If I come across anything I'll update this comment for you.
But I know what you mean about hating the sysadmin side of things. Of course there are people that really love it too -- there's sort of a mindset that you have to get into, just like with development. Maybe this could be an idea for a new service -- matching up programmers with side projects, with sysadmins that are looking for side projects to help manage.
Yea, that would be really a great idea. I have a friend, whom I ask for help some time, but he also has only limited time.
And I lack the time to really dive into automation stuff and that sort. So many interesting tools are available, but you always have to take time and learn it first.
> Do you know, if this rollback stuff is also available for Debian based systems?
Over half the servers I admin are running some flavor of Debian, and have for at least five years or so. To the best of my knowledge there's no rollback method for Debian updates, short of relying on filesystem tricks like using ZFS, discussed upthread.
Thank you for your information! That would be really a fine feature. I love Debian for its stability (as it is said ... I just lack time, to compare myself) but the administration tools are still a black box for me and I have the feeling, that they could need a brush-up.
Oh, how I wish this were true.
For what it's worth, it's pretty true as far as OpenBSD is concerned, in my experience. But OpenBSD is the exception here, not the rule. Everywhere else, developers all seem to have embraced "break early, break often".
Eventually you get burned. For me, it was a routine should-have-been-minor web server update where one of the packages I relied on suddenly became unsupported and every single hosted site stopped working. Since there's no way to roll back server upgrades, I had a marathon night involving building a new server stack and migrating all hosted sites there by 8 a.m.
But you can't yell at anybody when that happens, because the answer's always the same: it's not the developers' fault.
Who really believes sysadmins wouldn't update everything all the time if they could? Old, dodgy, out-of-date servers exist exactly because updates are butthole-puckering, because everyone's been burned at least once by a "minor" update, and because once the damage is done, undoing it is horrifyingly difficult.