I dunno, because of writeups like GitLabs? We're lucky GitLab is so transparent....

jarv · on Sept 17, 2020

> Actual analysis by the engineers of one of their migrated jobs (urgent-other sidekiq shard) says their cost went from $290/month when running in VMs to $700/month when running in Kubernetes. They tried to use auto-scaling but failed completely and ended up disabling it:

Note this is just one service out of many and has very low latency requirements, and due to our slow pod start times in this case it didn't make sense to auto-scale. We are auto-scaling for other services in K8s.

> says their cost went from $290/month when running in VMs to $700/month

These numbers represent a snapshot in time where we are overprovisioning, for this service for the migration. You are correct that cost benefit here remains to be seen for this service. We are doing things like isolating services into separate node pools which may not allow us to be as efficient.

Safer and faster deploys was a huge win for us, for this service and others. This of course is compared to our existing configuration using VMs and Chef to manage them.

disclaimer: blog post author

Keycap · on Sept 17, 2020

Its really weird that your argument is based on just the cost.

I mean you mentioned yourself you worked at google right?

Perhaps you just haven't actually experienced the issues kubernetes is solving?

How often have you seen that certificates expired? I have seen that. Plenty of times. Its not an issue creating that lets encrypt cronjob, its still something you need to do right.

Security Updates? Have you seen how many companies run with old non updated VMs?

Disk full due to logs? Yes seeing this regularly.

Memory leak on a service and someone needs to restart it manually until someone else fixes the issue? Yes!

Requesting a VM, hardware, infrastructure, getting it and the whole lifecycle management of it in the backend? Its real.

Ansible Scripts, puppet or just bash scripts and a word document to tell you how this magic machine was set up? Yepp.

Kubernetes solves all those problems.

Your static website on borg, if it sill runs, has probably still a valid certificate, is running on a secure infrastructure, is equally configured on every instance and not weird on 1 of 6 servers and just runs.

A smart person taking responsibility for all of this, costs you much more then just a few hundred bucks a month. And you need that person. With Kubernetes, this person now can manage and operate much more servers under his/her fingertips better easier and more secure then if it would have been vms.

And in my personal experience: That shit runs more stable because that shit can restart and being recreated and it just solves a handfull of shitty memory or disk full issues.

thu2111 · on Sept 18, 2020

I've been running my own Linux servers for about twenty years, so I'm not entirely unexperienced with these things.

Borg is/was great for running huge numbers of services at truly massive scale, when those services were developed entirely in house and done in the exact way it wanted services to be done. It was a terrible cost the moment you wanted to run anything third party or which wasn't written in that exact Google way, and the costs were especially high if you didn't need to handle huge traffic or data sizes.

Borg and Kubernetes don't auto-magically do sysadmin work. A ton of people work in infrastructure at Google. Software updates still need to be applied etc. In Kubernetes that means rebuilding Docker images (which in reality doesn't happen as the tools for this are poor, so you just have a lot of downlevel Ubuntu images floating around).

And worse, the whole K8s/Docker paradigm is totally backwards incompatible. At Google this didn't matter because the software stack evolved in parallel with the cluster management. Programs there expected local disk to be volatile, expected to be killed at a moment's notice, expected datacenters to appear and disappear like moles. They were written that way from scratch. But that came with a terrible price: it was basically impossible to use ordinary open source apps. You could import libraries and (carefully!) incorporate them into Borg-ized projects, but that was about it.

When this tech was reimplemented and thrown over the wall to the community, the cultural expectations that came with it didn't come with it. So I've seen situations like, "whoops, we deleted our companies private key because it was stored to local disk and then the Docker container was shut down, help!". This was even reported as a bug in the software! No, the bug is that your computer is randomly deleting critical files for no good reason, and normal software does not expect that to happen. How about the way in a Dockerfile you have to write 'apt-get update && apt-get upgrade' if you want a secure OS image when the container is rebuilt? If you put the two commands on separate lines it will appear to be working, right up until you start getting weird errors about missing files from Debian mirrors, because Docker assumes every single command you run is a pure function! And then we get to security.

Screwups seem to follow Kubernetes/Docker around like flies. The tech is complex and violates basic expectations programs have about how POSIX works. When it goes wrong, it leads to mistakes in production.

Now there have been a few trends over time:

1. Hardware has got a lot more powerful. It has been outpacing growth in the internet and economy. Many more businesses fit in a smaller number of machines than when Borg was designed (an era where 4-core systems were considered high end).

2. Cheap colo providers have been driving the cost of powerful VMs down to the ground.

3. Linux has got easier to administer.

These days setting up a bunch of Linux VMs that self-upgrade, run some services via systemd etc isn't difficult, and properly tuned such a setup should be able to serve a monster amount of traffic (watch out though for Azure, which seems to overcommit capacity pretty drastically and their VMs have very unstable performance).

Most businesses that are deploying Kubernetes today quite simply do not need a million machines. Even companies that give away complex services like GitLab, as we can see from this thread, they don't really need huge scalability. It's just nice to imagine that the business will experience explosive growth and that growth is now automated, but ironically, the effort to automate business infrastructure scaling takes away from the sort of efforts that actually grow the business.

As for the other benefits, you can write a systemd unit that is much simpler than Kubernetes configuration that will give you auto-restart, including if memory limits are hit, you can view the activity of a cluster easily using plain old SSH or something like Cockpit, it will handle log rotation for you out of the box, sandboxing likewise, and so on. And of course the unattended-upgrades package has existed for a while.

I agree you need someone to do admin work, whatever path you choose. Having used Kubernetes, and the system it is based on, and plain old Linux, my intuition is that the base cost of Kubernetes is too high for almost all its users. Too many ways to screw it up, too many ways for it to go wrong, too much time spent screwing around, and too expensive. If you become another Google or Facebook then sure, go for it. Otherwise, better avoided.

Keycap · on Sept 18, 2020

I'm not saying its not possible for small companies to run a handfull of VMs for themself.

But you reach quickly enough an size where a central managed kubernetes cluster is very versatile for your whole company.

1-2 People take care of the cluster, the other teams then use that kubernetes cluster for your build, hosting of staging envs etc.

And then you have a very small infra team which is much better able to provide those services to others internally much easier and safer.

marinj · on Sept 17, 2020

> Wrong. Actual analysis by the engineers of one of their migrated jobs (urgent-other sidekiq shard) says their cost went from $290/month when running in VMs to $700/month when running in Kubernetes.

This is only one of the shards of one service that we migrated, and we migrated many more.

I did a high level writeup of this and many other things we've observed in https://about.gitlab.com/handbook/engineering/infrastructure...

if you are curious. I am happy to have a conversation with anyone about our experiences in more detail, not to convince anyone that k8s is the "one-true-way" (because I am still not convinced myself), but of the benefits this type of change can bring.

> They tried to use auto-scaling but failed completely and ended up disabling it

That issue is just one of many though. We disabled it at the time, but we have it enabled for a number of other services.

Regardless of how I personally feel about K8s, I have to say that the migration we are doing for GitLab.com is generating set of benefits that goes far beyond just moving to a new platform.

One of the largest benefits I've seen so far is that it was a great forcing function to resolve some long running architectural challenges, and is making us think more about how the application can run at a very large scale, without being at the very large scale. Things that we could get away before, like the issue you referenced there, we can't anymore.

Disclaimer: I am one of the people involved in this migration.