Millions of Tiny Databases [pdf]

mjb · on Feb 14, 2020

This is my paper (along with Tao and Fan). It's a great feeling to have this published and available, and I'm super proud of the team behind Physalia.

There's a lighter-weight introduction to the work here: https://www.amazon.science/blog/amazon-ebs-addresses-the-cha... and for those attending NSDI, I'll be talking about Physalia in the "Deployment Experience" session on Wednesday.

pm90 · on Feb 14, 2020

Thanks so much for doing this! I really like to read about interesting engineering problems that are encountered at scale and the way engineers think about them and come up with solutions to solve them.

Probably tangential, but wanted to know what your thoughts are about the frequent case we often see where the interesting designs are made public without the actual implementation. e.g. the Map-Reduce paper-> Apache Hadoop. Dynamo-> Cassandra (and a few others), Spanner -> CockroachDB, Borg-> K8s etc.

Having worked at corporations most of my life, I know that internal systems often have a lot of internal dependencies which makes it really hard to open source them easily; often a refactor is more expensive than writing from scratch.

I can see it both ways; just wondering as the author, what your perspectives might be :).

revertts · on Feb 14, 2020

RE: NSDI - Is the Firecracker paper available somewhere, or not yet?

Edit: I'm an idiot - https://www.amazon.science/publications/firecracker-lightwei...

maxmcd · on Feb 14, 2020

Very cool. I haven't read the whole paper yet, but from a quick overview it seems somewhat similar to SLOG (which also deals with world-scale replication by trying to keep data closer to the nodes that use it):

- http://www.vldb.org/pvldb/vol12/p1747-ren.pdf

- https://blog.acolyer.org/2019/09/04/slog/

Any thoughts on this comparison?

mjb · on Feb 14, 2020

Very interesting, I hadn't seen SLOG before. It seems like there's a fairly similar core insight: placing data near where it's used can help with some system properties. They appear to be more latency-focused and we were (primarily) aiming at availability.

The other different part of Physalia is our focus (again, for availability) on placement for 'blast radius'. That means we try limit the number of cells than any one failure (software, infrastructure, etc) can touch. Geo-replicated systems can have similar concerns, but I haven't seen the same level of focus on it as a key design goal.

stebann · on Feb 15, 2020

Congrats!

kthejoker2 · on Feb 14, 2020

Thought for sure this'd be a thinkpiece on Excel in the enterprise ...

Seriously, though, this whole paper uses an amazing amount of terminology - blast radius, colony, color, game day, split brain - and an awesome biological metaphor of the Portuguese man o'war.

Great read even if you don't care about fault tolerance, CAP theorem, or distributed balancing at AWS-scale.

One sample quote of the value of cheap heuristics over full-blown number-crunching:

> Globally optimizing the placement of Physalia volumes is not feasible for two reasons, one is that it’s a non-convex optimization problem across huge numbers of variables, the other is that it needs to be done online because volumes and cells come and go at a high rate in our production environment. Figure 11 shows the results of using one very rough placement heuristic: a sort of bubble sort which swaps nodes between two cells at random if doing so would improve locality. In this simulation, we considered 20 candidates per cell. Even with this simplistic and cheap approach to placement, Physalia is able to offer significantly (up to 4x) reduced probability of losing availability.

ignoramous · on Feb 14, 2020

Abstract at: https://www.amazon.science/publications/millions-of-tiny-dat...

> ...Physalia is a transactional key-value store, optimized for use in large-scale cloud control planes, which takes advantage of knowledge of transaction patterns and infrastructure design to offer both high availability and strong consistency to millions of clients. Physalia uses its knowledge of data center topology to place data where it is most likely to be available. Instead of being highly available for all keys to all clients, Physalia focuses on being extremely available for only the keys it knows each client needs, from the perspective of that client.

> ...We believe that the same patterns, and approach to design, are widely applicable to distributed systems problems like control planes,configuration management, and service discovery.

It'd be interesting to constrast this approach with Route53's or IAM's datastore which need to be globally-replicated with time-bounded eventually-consistent reads, and transactional but verifiable writes.

I hope AWS begins publishing about S3, now. One can look at the patents AWS engineers author to get a feel for some of the internals, but they are (intentionally?) hard to read.

For instance, patents filed by two of the many S3 founding-engineers: https://patents.google.com/?inventor=James+Christopher+Soren...

Also see:

https://aws.amazon.com/builders-library/

https://research.google/pubs/

mjb · on Feb 14, 2020

It's not really about the design of S3, but if you're interested in some of the philosophy and thinking behind S3 you might enjoy "Beyond eleven nines: Lessons from Amazon S3 culture of durability" https://www.youtube.com/watch?v=DzRyrvUF-C0

vimota · on Feb 14, 2020

Tangentially related, BigQuery uses a similar usage-based approach to place and replicate data in a manner that's likely to be available for users:

https://cloud.google.com/blog/products/data-analytics/how-bi...