The tax accountant analogy goes pretty well with this: The accountant is special...

derefr · on Dec 21, 2015

And so, ideally, I want to be able to tell my accountant to store my archival tax records in my safe-deposit box, not in his office. Compute infrastructure is a commodity—doesn't matter who's paying for it—but all the services you depend on should rely on storage infrastructure you have an SLA agreement with.

I don't care about the "distributed computation" promises of Diaspora or Sandstorm.io; I think they're wrongheaded. Anyone can do compute. But I really do hope that one day that my Facebook account can be canonically "stored on" a database instance I'm paying for, that Facebook's app servers will reach out and connect to and treat as the canonical source for "me", treating their own records as a secondary cache. This kind of setup would makes all sorts of things simpler and clearer and more secure; there would be a definite boundary where "my data" stops (the DB I own) and "Facebook's data" starts (the DBs they own.)

And, to be clear, I'm not talking about everybody running their own infrastructure, or even everybody knowing what an IaaS provider is. Ideally, PaaS providers could get into the "consumer instances" game the way Dropbox is in the "consumer files" game. My "Facebook account database" above could be transparently launched into my own cute little private cloud by my PaaS provider when Facebook requests it through some OAuth-like pairing API. I wouldn't need to think of myself, as a user, as "owning cloud database instances." From my perspective, I'd get an abstract "Facebook account" (which is actually an app instance and attached DBs) sitting in my Heroku-for-consumers account. The important bit is that I'd be paying for the resources that "account" object consumes, that I'd have an SLA on those resources, and that the PaaS company would have every incentive to make it easy for other third-party services to interact with my "Facebook database" in a way Facebook themselves aren't. I, as a user, have no need to "manage" a cloud of my own; I just need to be considered to own it.

jimbokun · on Dec 21, 2015

If I'm a Facebook engineer, I would never agree to this because there is no possible way to optimize performance in this scenario.

What happens to Facebook when your data provider goes down, or just gets slow? What if they mess up permissions or change their API?

Maybe you're thinking "that's fine, if my provider isn't reliable, my Facebook account becomes unavailable and it's up to me to choose a better provider." But what about all the people who are sharing your feed (or whatever it's called these days, I don't really use Facebook)? Do they query your stuff, and then timeout when it doesn't respond in time? Now other people's stuff is slow to load.

Just seems like an engineering nightmare, to me.

derefr · on Dec 22, 2015

Think of the Datomic architecture[1]: some nodes are "storage", while other nodes are "transactors." The transactor nodes pull "chunks" of rows/objects/documents from storage nodes, compute relational indexes locally, answer queries from those computed indexes, and finally persist computed index "chunks" back to the storage nodes.

Now, note that the "canonical" storage that gets read from doesn't have to be the same storage that the indexes get written to. The first can be owned by the user, while the second can be owned by Facebook.

Presuming an architecture like this, the latency and availability of the user's "Facebook account" database is relatively immaterial. While writes would have to be synchronous (so, like you said, Facebook would have to give the user a "sorry, your account is unavailable" error), reads could be asynchronous. Think of an online RSS feed reader service: the "primary sources" are the third-party sites with their RSS feeds. Sometimes those sites go down. When they do, the reader-service can't retrieve the feed–so that feed just goes stale.

Things like Facebook's graph, meanwhile, are fundamentally indexes. The base-level "documents" in the graph are relationship assertions—a copy of "B accepted A's friend request" stored in B's database. The graph is a computed value built on a pile of those. When Facebook can't reach someone's database, things like these relationship assertions just go stale.

The crucial idea, here, is that for Facebook to do its job, it probably has to cache a majority of the stuff in the user's database in one form or another—just like an RSS reader caches RSS feeds. But this is purely a cache, in a fundamental sense. Users who don't "check in" with Facebook could be cache-evicted from its database. Other users would still have relationships with them and be able to post on their wall and such (they'd be putting those documents in their own outbox); Facebook would just no longer bother computing anything that's personal to the user, like their news feed. There would be every incentive to set up the architecture such that user data that wasn't needed would be "garbage collected" off of Facebook's servers, because it could always get put back on, the moment that user's account-instance woke up again and said hello.

This would also mean that Facebook wouldn't need to store any of their user-data in anything resembling a relational normal form. Every table would be a "view" table. The canonical database, owned by the user, could be relational and full of nice constraints and triggers (and the user could even add these themselves); but since Facebook can just query out of that to get any data it's missing, it wouldn't need anything like a "users" table. (Fascinatingly, if Facebook was built as a microservice architecture, this means that each microservice would probably separately query the canonical data from the database in order to generate its own indexes; the Search service would know one "face" of you while the Photos service would know quite another. These could even—in theory—be separately ACLed within your own DB instance, giving the user true, actual control over what Facebook can do with their data, component by component.)

[1] http://docs.datomic.com/architecture.html

ianopolous · on Dec 21, 2015

It sounds like what you want is close to the goals of Peergos (https://github.com/ianopolous/Peergos). Disclaimer: I am the team lead. We use IPFS for the storage (and P2P networking and DHT) and are storage provider agnostic, where the provider can't read your data. You can add as many IPFS instances that backup (pin) your files as you like, on whatever storage provider you like. You could grant a company like Facebook, fine grained read or write access to sections of your data that are relevant to them.