You don't. That's the point of volumes. You do need to be careful to ensure you mount volumes or everything that needs to persist, but in practice that's not a very onerous limitation.
Since volumes are bind-mounted from the outside, you can put the volumes on whatever storage pool you want that you can bring up on the host (as long as it meets your apps requirements, e.g. POSIX semantics, locking etc).
E.g. at work we have a private Docker repository that runs on top of GlusterFS volumes that are mounted on thehost and then bind mounted in as a volume in the container.
I also run my postgres instances inside Docker, with the data on persistent volumes.
We could certainly use better tools to manage it, though.
Another pattern I ought to have mentioned (that I've used myself) is to set up "empty" containers whose only purpose is to act as a storage volume for another container. I don't like that as much, mostly since I've not had as much long-term experience with how the layering Docker uses would impact it.
Having persistent containers strikes me as the infrastructural equivalent of a code smell. I don't use in-container storage for anything that needs to be persisted at all, and I'm not sure why you would in a modern environment. Everything can fail and fail hard, and writing meaningful (i.e., volume'd) data to disk seems like asking for trouble. Ephemeral containers just seems to fit the model of Docker's capabilities much better than maintaining state.
The exception to this would be, I guess, you could wedge in Docker containers for database server isolation or something, but my databases don't run on multi-tenant instances so there isn't a huge win to it. (I use RDS most of the time, let somebody else manage that problem.)
> Having persistent containers strikes me as the infrastructural equivalent of a code smell.
Which is why pretty much all advice regarding Docker is to use volumes, so that the persistent data is managed from outside the container, and the container itself can be discarded at will without affecting the data volume.
My preferred method is bind-mounted volumes from the host. They are not part of the containers, and the purpose is exactly to remove the need or persistent containers. A lot of the examples I gave in my article relies heavily on this.
This leaves the form of persistence up to the administrator. And of course that means you can do stupid things, or not. On my home server that means a mirrored pair of drives combined with regular snapshots to a third disk + nightly offsite backups. At work, we're increasingly using GlusterFS on top of RAID6, so we lose multiple drives per server, or whole servers before the cluster is in jeopardy (and even then we have regular offsite snapshot throughout the day + nightly backups).
If you are referring to the pattern of creating "empty" containers to act as storage volumes, then I sort-of agree with you, but mostly because of the maturity of Docker. After all nothing stops you from putting the docker storage itself on equally safeguarded storage. It's not really the risk of losing storage that makes me prefer stateless containers, but that separating state and data substantially reduces the data volume that needs to be secured (since we can spin up new stateless containers in seconds, we only really care about preventing loss of the persistent data volumes).
Persistent containers is the wrong term. "On-host storage" might be a better one. I don't store anything on my compute nodes. Everything is in a HA datastore or in S3. I kind of feel like architectures that rely on storing any data you can't immediately blow away without shedding a tear are, in modern environments, dangerous, and volumes seem (to me) to lead to the kind of non-fault-tolerant statelessness that'll bite you in the end.
I respect GlusterFS, so it sounds reasonable in your own use case, but it still makes me intensely uncomfortable to have apps managing their own data. I try to build systems where each component does one thing well, and for the components that I think work well in a Docker container, data storage is then kind of out-of-scope. YMMV.
We decided not to use docker containers for our postgres DB in production; Volume mounts just don't make me sleep easy. We use an S3-backed private docker registry to store our images/repositories.
What is your issue with volumes? Bind mounts have been battle tested over many years. I e.g. have production Gluster volumes bind-mounted into LXC containers that have been running uninterrupted for 5+ years.
Nothing wrong with mounting volumes, but certainly not for production DB's, at least not at this point. I don't have much insight to share on this, besides my intuition based on several years working on HA systems. The clusterhq folks are working on this problem though, and I am following keenly.
Docker volumes are just bind-mounted directories. Of all the things you should worry about when running HA systems, bind-mounts is definitely not one of them. They are essentially overhead-free and completely stable. This is completely orthogonal to Docker - after setting up the bind-mounts it gets completely out of the way of the critical path.
We have definitely used them at large scale in production for several years with no issues.
Thanks shykes for chiming in on this. We use docker heavily for our applications in production (and all other envs), and would absolutely like to extend that to our Postgres DBs. Are there any examples you can point us to with a pattern for Postgres HA clusters with docker? Thanks.
With docker specifically, I've had frustrations with the user permissions on the volume between {pg container, data container, host OS}, with extra trouble when osx/boot2docker is added to the stack.
Also, docker doesn't add as much value for something like postgres that likely lives on it's own machine.
> I've had frustrations with the user permissions on the volume between {pg container, data container, host OS}
Then don't use data containers. I don't see much benefit from that either. The stuff we put on data volumes is stuff we want to manage the availability of very carefully, so I prefer more direct control.
And so when I use volumes it's always bind mounts from the host. Some of them are local disk, some of them are network filesystems.
We have some Gluster volumes that are exported from Docker containers that imports the raw storage via bind mounts from their respective hosts, for example, and then mounted on other hosts and bind-mounted into our other containers, just to make things convoluted - works great for high availability (I'm not recommending Gluster for Postgres, btw.; it "should" work with the right config, but I'd not dare without very, very extensive testing; nothing specific with Gluster, just generally terrified of databases on distributed filesystesms).
> for something like postgres that likely lives on it's own machine.
We usually colocate all our postgres instances with other stuff. There's generally a huge discrepancy between the most cost-effective amount of storage/iops, RAM and processing power if you're aiming for high density colocation, so it's far cheaper for us that way.
Since volumes are bind-mounted from the outside, you can put the volumes on whatever storage pool you want that you can bring up on the host (as long as it meets your apps requirements, e.g. POSIX semantics, locking etc).
E.g. at work we have a private Docker repository that runs on top of GlusterFS volumes that are mounted on thehost and then bind mounted in as a volume in the container.
I also run my postgres instances inside Docker, with the data on persistent volumes.
We could certainly use better tools to manage it, though.
Another pattern I ought to have mentioned (that I've used myself) is to set up "empty" containers whose only purpose is to act as a storage volume for another container. I don't like that as much, mostly since I've not had as much long-term experience with how the layering Docker uses would impact it.