The etcd operator: Simplify etcd cluster configuration and management

darren0 · on Nov 3, 2016

I would love to understand the design rational on why a custom controller is needed to run etcd as opposed to leveraging the existing k8s constructs such as replicasets or petsets. While this is a very useful piece of technology it gives me the wrong impression that if you want to run a persistent or "more complicated" work load then you must develop a significant amount of code for that to work on k8s. Which I don't believe is the case, which I why I'm asking why this route was chosen.

theptip · on Nov 3, 2016

The FAQ at the end of the OP addresses this:

"Q: How is this different than StatefulSets (previously PetSets)?

A: StatefulSets are designed to enable support in Kubernetes for applications that require the cluster to give them "stateful resources" like static IPs and storage. Applications that need this more stateful deployment model still need Operator automation to alert and act on failure, backup, or reconfigure. So, an Operator for applications needing these deployment properties and could use StatefulSets instead of leveraging ReplicaSets or Deployments."

There is inevitably some app-specific logic required to modify a complex stateful deployment; the Operator encapsulates this logic so that the external interface is a simple config file.

philips · on Nov 3, 2016

Darren, the FAQ is at the overview post here: https://coreos.com/blog/introducing-operators.html

darren0 · on Nov 3, 2016

Thanks, but this alludes to more operators coming. Why is such a thing really needed. The proliferation of this approach seems like a potential downfall of k8s. Similar to Mesos where a framework is quite powerful, the cost of developing one is too high. This blog post basically implies that k8s base constructs can't run postgres, redis, prometheus, etcd, cassandra, etc. But why? Are we saying that stateful services fundamentally require one off domain specific logic to run in k8s?

philips · on Nov 4, 2016

I don't think stateful applications require the use of something like an Operator. It really just comes down to where the state lives. For example if you are wanting to run your database on top of an EBS, SAN, or something like that it is no problems to just throw up a StatefulSet and go for it.

However, if you start to think about things like orchestrating scaling of databases that have a administrative tool like Cassandra, Vitess, RethinkDB, read-replicated Postgres, etc you need to have some glue/gel that integrates those admin tools with the Kubernetes APIs. And that is what a minimal Operator should do.

The other thing Operators can do is glue existing software to the Kubernetes APIs. Which is what the Prometheus Operator does. Prometheus has its own configuration system for finding monitoring targets; instead of forcing the user to drop down into that different format the Operator adapts Kubernetes concepts like label queries and generates the equivalent Prometheus config.

Overall I don't think this is required for every application. Static databases persisting to shared storage, stateless applications, or cluster wide daemons all fit nicely in Kubernetes abstractions before Operators. But, there is a class of clustered applications that are served well by this pattern.

tristanz · on Nov 4, 2016

I think philips is exactly right. The whole design of Kubernetes is geared toward allowing users write their own controllers for advanced use cases. You could view something like Jenkins or Vitess as controllers because they spawn Kubernetes pods on demand. The beauty is Kubernetes gives you great primitives, so you often will be controlling these objects, not the underlying pods. Of course many simple applications don't need a controller, although I suspect more and more simple use cases will be managed by an external controller like Helm which orchestrates the lifecycle of applications.

The questions is: Why isn't this just called a controller? What's this new term Operator?

philips · on Nov 4, 2016

Controller wasn't quite the term to capture the combination of an application specific controller and third party resource to manage a collection of user created application instances.

So, we arrived at Operator. It felt like a good term that we could put after X that encapsulates the intent of the pattern. It helps you operate instances of an application.

invisible · on Nov 4, 2016

The reason is that while replication controllers and daemonsets fulfill making sure there are enough of xyz pod, they do not ensure that replication is set up between those pods. For example, in redis one must do `slaveof` for slaves if implementing basic replication. If the master goes down, a new master must be appointed and then `slaveof` configured on afterward. That in itself requires either a) an operator or b) concensus among enough nodes to reach a quorom about these.

So while kubernetes could be the authority (via the master), generally it's only meant for scheduling pods. There would have to be some better guarantees about master availability if it were to control these things.

Terretta · on Nov 4, 2016

"An Operator represents human operational knowledge in software, to reliably manage an application."

Loving this.

SEJeff · on Nov 4, 2016

Do you ever see this pattern being used to host/run the etcd cluster for the k8s apiserver to speak with? That would be quite meta (and kind of amazing if it would actually work).

philips · on Nov 4, 2016

This is in progress! We have it prototyped out in bootkube. This is the plan. :)

SEJeff · on Nov 4, 2016

Oh this is absolutely brilliant! So with the kubelet + systemd, you more or less have self managed everything. Shutup and take my money :)

hatred · on Nov 3, 2016

The concept of custom controllers looks similar to what schedulers are in Mesos. It's nice to see the two communities taking a leaf out of each other's books e.g., Mesos would introduce experimental support for task groups (aka Pods) in 1.1.

Disclaimer: I work at Mesosphere on Mesos.

ideal0227 · on Nov 4, 2016

Yea. They are similar in functionality.

But they work differently. The operator does not really “schedule” containers. It finishes the controlling logic by using Kubernetes APIs. For example, it uses native Kubernetes health checking, service discovery, deployment. It works completely on top of Kubernetes API, so no specialized scheduler, executor or proxy are needed comparing to https://github.com/mesosphere/etcd-mesos/blob/master/docs/ar....

The advantages of Mesos is exposing lower level APIs and resources to allow more control. The etcd operator we built does not really need that. Building this kind of application operator may be simpler on k8s than on native Mesos.

Disclaimer: I work at CoreOS on Kubernetes and etcd.

ex3ndr · on Nov 4, 2016

Can someone clarify some points?

* Isn't etcd2 is required to start kubernetes? I found that if etcd2 is not helaty or connection is just temporary lost then k8s just freezes it's scheduling and API. So what if Operator and etcd2 is working on one node and it is down? Also i found that etcd2 also freezes event when one node is down. Isn't it unrecoverable situation?

* k8s/coreos manual recommends to have etcd2 servers not that far from each other mostly because it have very strict requirements about networks (ping 5ms or so) that for some pairs of servers couldn't work well.

* What if we will lost ALL nodes and it will create almost new cluster from backups, but what if we will need to restore latest version (not 30 mins ago)?

philips · on Nov 4, 2016

1) Yes, Kubernetes relies on etcd as its primary database. Right now the etcd Operator does not tackle trying to manage the etcd that Kubernetes relies on. But! We are working on that as part of our self-hosted work https://coreos.com/blog/self-hosted-kubernetes.html. Stay tuned.

2) etcd can deal with any latency up to seconds long for say a globally replicated etcd. But! You need to tune etcd to expect that latency so it doesn't trigger a leader election. See the tuning guide: https://coreos.com/etcd/docs/latest/tuning.html

3) The backups are something that we are just getting to with the etcd Operator. Our intention is to help you create backups and create new clusters from arbitrarily old backups, but that work hasn't started yet.

jbpetersen · on Nov 3, 2016

Being someone who's been getting more familiar lately with backend engineering and has been trying to make sense of various options, I've got a strong enough impression of CoreOS that I'm betting my time it'll be dominating the next few years.

I also can't wait to see an open version of AWS Lambda / Google Functions appear.

justincormack · on Nov 3, 2016

IBM has already launched openwhisk, and there are some other open lambda implementations.

duaneb · on Nov 3, 2016

There are already lambda implementations available; I can't speak to google functions.

jbpetersen · on Nov 4, 2016

Is there a significant difference?

russell_h · on Nov 3, 2016

I've been thinking about implementing a custom controller that would use Third Party Resources as a way to install and manage an application on top of Kubernetes. The way that Kuberetes controllers work (watching a declarative configuration, and "making it so") seems like a great fit for the problem.

Its exciting to see CoreOS working in the same direction - this looks much more elegant than what I would have hacked up.

theptip · on Nov 4, 2016

I've been thinking the same way; the k8s Third Party Resource API really enables some clever solutions.

While most k8s users are (from what I can tell) currently writing YAML config files and loading them by hand (encouraged by tools like Helm and Spread), I think that the k8s apps of the future will be more like the Operator;

1) The 'deploy scripts' are controllers that run in your k8s cluster and dynamically ensure the rest of your code is running, and the primitives that you operate on will be your custom ThirdPartyResources.

2) All of the config for your app is wrapped in a domain-specific k8s object spec; instead of writing a YAML file and uploading it as a raw Deployment, you would create a FooService API object with just the parameters that you actually care about for configuring your service.

Right now it's a pain and a lot of code (>10kloc of Go for the etcd-operator!) but I'm sure that a bunch of that could be abstracted out into a framework that makes it easy to generate/build operators for a variety of application use-cases.

Currently the solutions that build and deploy your code for you in k8s seem to be PaaS replacements (Deis, Openshift), which take a very generic approach to bundling your code. That's probably going to work for common use-cases, but I suspect the more bespoke deployments will need something more like the Operator approach, and I'm looking forward to seeing what tooling evolves in this area.

dantiberian · on Nov 3, 2016

This sounds a lot like Joyent's Autopilot Pattern (http://autopilotpattern.io), but will be more integrated with Kubernetes, rather than being agnostic.

doublerebel · on Nov 3, 2016

Thanks, I remember seeing the autopilot pattern mentioned on Joyent's blog, but haven't seen that website. The lifecycle [0] looks remarkably similar to the build and deployment steps outlined in Distelli's manifest [1]. I use Distelli+Consul on Joyent so I suppose I've been doing the autopilot pattern without realizing it!

I know that much of Distelli's workflow comes from the founders' experience at AWS, so I wonder where the root of this pattern lies. Perhaps that would help unify these similar methods.

[0]: http://autopilotpattern.io/#how-do-we-do-it

[1]: https://www.distelli.com/docs/manifest/deployment-types

0x74696d · on Nov 3, 2016

I'm the lead developer for Joyent of ContainerPilot, which is the tool at the core of our Autopilot Pattern implementation examples. The lifecycle events you recognize in Distelli are definitely similar. And Chef's new tool Habitat has a supervisor that was independently developed but ended up having interesting parallels with ContainerPilot. So there's a universal idea lurking under there, which is why we called Autopilot a "Pattern" rather than a tool in itself.

But it's not clear to me from a casual glance at the the docs whether Distelli lives inside the container during those hooks? That's part of the distinction of the Autopilot Pattern is making the higher-level orchestration layer as thin as possible.

(As far as the root, some of it is derived from my experiences as a perhaps-foolishly-early adopter of Docker in prod at my previous gig at a streaming media startup. The rest is derived from both principals with which Joyent's own Triton infra is built and our experiences speaking with enterprise devs and ops teams.)

arrsingh · on Nov 4, 2016

I'm the founder at Distelli and I just want to clarify that the Distelli agent doesn't typically live inside the container though it can. Its used to orchestrate the container lifecycle on the VM itself.

However if you're building Docker containers and deploying them we recommend using Kubernetes which is something that Distelli supports out of the box now - https://www.distelli.com

doublerebel · on Nov 4, 2016

Thanks for sharing those details, great to have more insight into the process.

In my case the Distelli agent does live inside the "container", because I'm using SmartOS instances and not Docker containers. It handles deployment, and monitors processes of the apps when I'm not using an SMF.

I'm not sure how Distelli's K8s orchestration works, that functionality is more recent. In my case, the lifecycle details are in the manifest in the app repo, which is just a YAML where each lifecycle section is a bash script. App builds are just tarballs in S3. So there's not much to the deployment process.

adieu · on Nov 3, 2016

This is great news. We developed an internal controller managing etcd cluster used by kubernetes apiserver using third party resource too. The control loop design pattern works really well.

why-el · on Nov 4, 2016

Somewhat unrelated, but I am just curious. For those who use etcd (and this is coming from a place of ignorance), does the key layout (which the keys are currently stored, how they are structured) get out of hand? Meaning, does it get to a place where a dev working with etcd might not have an idea about what in etcd at any given time? Or do teams force some kind of policy (in documentation or code) that everyone must respect?

I am asking because I was in situation where I was introduced to other key-value stores, and because the team working with them is big and no process was followed to group all keys in one place, it was hard to know "what is in the store" at any moment, short of exhausting all the entry points in the code.

NegatioN · on Nov 4, 2016

I see it mentioned in the article that they have created a tool similar to Chaos Monkey for k8s, but I don't see any resources linking to it.

Will this at some point be available publically? Although k8s ensures pods are rescheduled, many applications do not handle it well, so I think a lot of teams can benefit from having something like that.

ideal0227 · on Nov 4, 2016

The "Chaos Monkey" lives inside the project as a sub-pkg right now: https://github.com/coreos/etcd-operator/tree/master/pkg/chao....

We plan to make it a separate project once we feel good about its functionality and reliability.

If you have any potential use case, requirement in mind, please tell us. :)

hosh · on Nov 3, 2016

This is brilliant. It's like the promise-theory-based convergence tools (CFEngine, Puppet, Chef) on top of K8S primitives. Better yet, the extension works like other K8S addons -- you start it up by scheduling the controller pod. That means potentially, I could use it in say, GKE, which I might not have direct control over the kube-master.

I wonder if it is leverging PetSets. I also wonder how this overlaps or plays with Deis's Helm project.

I'm looking forward to some things implemented like this: Kafka/Zookeeper, PostgreSQL, Mongodb, Vault, to name a few.

I also wonder it means something like Chef could be retooled as a K8S controller.

philips · on Nov 4, 2016

All of your questions are answered in the FAQ section of the overview post: https://coreos.com/blog/introducing-operators.html

hosh · on Nov 4, 2016

I don't think my specific questions are answered by FAQ on that page.

The only answer I found that addresses one part of what I'm wondering about is "How is this different from configuration management like Puppet or Chef?" However, I did not ask that question.

If you read some of the Mike Burgess's "Promise Theory: Principles and Applications", you'll realize that Operators (and Kubernetes controllers for that matter) are applications and implementations of specific parts of the Promise Theory. This idea that Puppet or Chef is "configuration management" is a story sold to non-technical people. I would argue that Operators may be a _better_ application of Promise Theory than previous-generation tools.

Puppet or Chef running as a Kubernetes controller might be able to twiddle things. It's not exactly a great fit because both would be calling each respective servers rather than using Third Party Resources on the kube master (and such, unwieldy) The DSL in each would have to be extended for things useful for controlling a Kubernetes cluster, but once in place, it can do exactly what that etcd Operator does: converge on the desired state by managing memberships and doing cleanups.

Don't get me wrong: I like the CoreOS technology as well as Kubernetes. I've deployed on CoreOS and Kubernetes before. I get that companies have a responsibility to control the story and the messaging ... but what I am asking are questions that are bigger than any single technology or company, and I like to make up my own mind about things.

otterley · on Nov 4, 2016

Where is the functional specification for an Operator? It sounds like a K8S primitive; is that in fact true? If not, why does this post make it sound like one?