Thanks so much for doing this! I really like to read about interesting engineering problems that are encountered at scale and the way engineers think about them and come up with solutions to solve them.
Probably tangential, but wanted to know what your thoughts are about the frequent case we often see where the interesting designs are made public without the actual implementation. e.g. the Map-Reduce paper-> Apache Hadoop. Dynamo-> Cassandra (and a few others), Spanner -> CockroachDB, Borg-> K8s etc.
Having worked at corporations most of my life, I know that internal systems often have a lot of internal dependencies which makes it really hard to open source them easily; often a refactor is more expensive than writing from scratch.
I can see it both ways; just wondering as the author, what your perspectives might be :).
Very cool. I haven't read the whole paper yet, but from a quick overview it seems somewhat similar to SLOG (which also deals with world-scale replication by trying to keep data closer to the nodes that use it):
Very interesting, I hadn't seen SLOG before. It seems like there's a fairly similar core insight: placing data near where it's used can help with some system properties. They appear to be more latency-focused and we were (primarily) aiming at availability.
The other different part of Physalia is our focus (again, for availability) on placement for 'blast radius'. That means we try limit the number of cells than any one failure (software, infrastructure, etc) can touch. Geo-replicated systems can have similar concerns, but I haven't seen the same level of focus on it as a key design goal.
Thought for sure this'd be a thinkpiece on Excel in the enterprise ...
Seriously, though, this whole paper uses an amazing amount of terminology - blast radius, colony, color, game day, split brain - and an awesome biological metaphor of the Portuguese man o'war.
Great read even if you don't care about fault tolerance, CAP theorem, or distributed balancing at AWS-scale.
One sample quote of the value of cheap heuristics over full-blown number-crunching:
> Globally optimizing the placement of Physalia
volumes is not feasible for two reasons, one is that it’s a
non-convex optimization problem across huge numbers of
variables, the other is that it needs to be done online because
volumes and cells come and go at a high rate in our production environment. Figure 11 shows the results of using one
very rough placement heuristic: a sort of bubble sort which
swaps nodes between two cells at random if doing so would
improve locality. In this simulation, we considered 20 candidates per cell. Even with this simplistic and cheap approach
to placement, Physalia is able to offer significantly (up to 4x)
reduced probability of losing availability.
> ...Physalia is a transactional key-value store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients. Physalia uses its knowledge of data center topology to place data where it is most likely to be available. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client.
> ...We believe that the same patterns, and approach to design, are widely applicable to distributed systems problems like control planes,configuration management, and service discovery.
It'd be interesting to constrast this approach with Route53's or IAM's datastore which need to be globally-replicated with time-bounded eventually-consistent reads, and transactional but verifiable writes.
I hope AWS begins publishing about S3, now. One can look at the patents AWS engineers author to get a feel for some of the internals, but they are (intentionally?) hard to read.
It's not really about the design of S3, but if you're interested in some of the philosophy and thinking behind S3 you might enjoy "Beyond eleven nines: Lessons from Amazon S3 culture of durability" https://www.youtube.com/watch?v=DzRyrvUF-C0
There's a lighter-weight introduction to the work here: https://www.amazon.science/blog/amazon-ebs-addresses-the-cha... and for those attending NSDI, I'll be talking about Physalia in the "Deployment Experience" session on Wednesday.