ActorDB is a neat project. Simplified, it's a distributed SQLite where consistency is guaranteed by the Raft protocol.
However, it has a unique and initially confusing data model where the database is divided into "actors", which are self-contained shards. For example, a database modeled on Hacker News would probably have an actor per user and an actor per story, and probably an actor per thread.
Every actor acts like a self-contained database, with its own set of tables. When you want to query or update data, you first tell it which actor(s) to operate on; but unlike systems like Cassandra, the sharding is explicit, in that the shards have identifiers, and there's no automatic sharding function. Indeed, you can have actors that act like a basic database, e.g. a global, shared list of lookup values could be a single actor.
You have ACID transactions within actors, and you can also do transactions across actors, and queries can also span multiple actors. You can't do joins across actors, as far as I recall. Schema migrations also become interesting, since each actor has its own, entirely separate schema.
> You have ACID transactions within actors, and you can also do transactions across actors
I gotta say... this project strikes me as very strange. If you're going to go through all the trouble of sharding your data across a set of HA actors... why not also shard behavior? Co-locate data and behavior! Isn't that a key driver Actor model? Then you wouldn't communicate with your actors using a very limited language like SQL you could send them real domain-specific messages. Most importantly, when it comes to transactions across actors... you don't need it. All you need is guaranteed message delivery to the various actor mailboxes and all that complexity melts away. You need this anyways if you want your actors to be able to actually collaborate and send each other messages. This whole project seems to be designed to eliminate many of the benefits of the actor model...
(I don't like the all-too-common HN cynicism but I also worry when I see stuff like this. Either I'm crazy and missing something obvious... or everybody else is crazy. But then much of what comes out of the Erlang space is... strange to me. That culture seems to have a very unique idea of distributed computing.)
I think you have to think of the "actors" in ActorDB as shards, not as actors in the "actor model" sense. The nomenclature here is unfortunate, because actors themselves are sharded.
This is a database that permits any application to read and write data without coupling it to a specific language or platform, e.g. Erlang. You just write SQL. People know SQL. The difference is that your app must be shard-aware, which can be an acceptable compromise for apps that wants to scale far. The sharding also gives you some measure of enforced encapsulation, since you can't have foreign keys (that I know) across actors.
This project might be particularly suitable for SaaS-type multitenant systems that need to have a clear boundary between each tenant.
Admittedly, I read through all the docs in one rapid sitting, but I thought I came across mention of foreign keys across actors being supported. I’ll have to double-check that, though.
In terms of GDPR compliance the fundamental element is pseudonymization of internal dataflows and data management -- that'll work with any data store.
I mean, there is some possible advantage by having all users in their own actor-namespace, for bulk deletes and concrete mapping of data-usage, but IME you're always stuck with a need to preserve the operational state of the application and the quality of its legacy data.
Regardless of sharding strategy, somehow your datawarehouse needs to be able to pull out or confirm data from the previous quarter...
You have a great knack for explaining these concepts clearly for a beginner like myself . If you ever consider writing a technical book I think it would be great!
Digging through the docs, I can't find any actual information of the consistency and isolation guarantees other than "consistent" and "ACID". That's an immediate red flag for me. What are the actual isolation characteristics of the database?
Does it use snapshot isolation? Is it serializable? Is it linerizable? With all the great work Kyle Kingsbury (aka Aphyr) has done on the Jepsen tests, it's pretty clear that claiming to be "ACID" with no additional info isn't sufficient for a modern database.
Most docs for technical projects are written that way these days. The author of an api, project or solution will assume an unquestioned adoption "because it's free" from the randome audience with no specifics on how they are solving and more importantly what's different from other cases. I don't know if this is by laziness and writing procrastination, but assuming that your users will know as much as you do on your journey to assemble a solution has become a frustration when I discover new projects like this one.
Likely has a lot more to do with the fact that the intersection between the population that can create such projects and the population that can write clear effective documentation is exceedingly small.
I would also point to Paul Grahams insight around hackers and painters[0]: the population who wants to create such projects is fantasizing about coding, hacking, publishing, bug squashing, and launches. Documentation isn't necessarily pulling them to their PC after a heavy dinner.
SQLite uses serializable isolation by default, this is built on top of that.
I'm also curious how Raft is used and what the resulting guarantees on distributed table operations are, but I'm coming at it assuming good faith, since SQLite is... robust.
> SQLite uses serializable isolation by default, this is built on top of that.
That's sort of a meaningless statement tho. Everything in the world is built on top of RAM which is serializable. The real complexity comes in how those foundational blocks are abstracted and combined.
We know that it's not performant at scale to represent an entire database log a single Raft group. So how are the Raft groups structured organized, and how and when do they cross-communicate?
ActorDB sounds like an interesting piece of software. It looks like it's good for simple applications that have large amounts of data. Some thoughts I have based on the docs:
ActorDB has no concurrency at the actor level. This makes it a poor fit for applications that have lots of concurrency around the same pieces of data. A long running read or write on a single actor will lock out any other reads or writes.
Likewise, distributed transactions lock all actors involved in the transaction until the transaction completes.
It seems like reads have to go through a round of Raft. This increases latency for reads. It also decreases throughput, although I'm not sure how big of a deal that is given the lack of concurrency.
It's unclear to me how ActorDB guarantees serializability for multi-actor transactions. You need some way to guarantee two multi-actor transactions will execute in the same order on every actor. Based on the docs, Raft is performed at the actor level and not across multiple actors. ActorDB does use two-phase commit to guarantee atomicity across multiple actors, but there's no description of how it handles serializability.
Based on my reading, ActorDB is good if you have lots of data and your queries have either low concurrency and you don't require high throughput. If you have high concurrency or require high throughput, my guess is ActorDB will be a poor fit.
I used to work with Pivotal's Greenplum, which is also a distributed db, and I quiet liked it. Basically a postgresql with syntax sugar on partition , and distribution across servers. I had the pleasure to never need any index in the database.
This sounds to be an interesting project as well.
The question to me always about "how this will makes the project that use it have little learning curve for the new recruits, easy to understand integration in the code level and low maintenance on the long run"
The core strengths of Erlang: Erlang has built-in support for concurrency, distribution and fault tolerance. There's many languages I enjoy, Erlang I've never learned (syntax and nobody to pay me to work with it) it fully to it's potential. I would also love to see a full blown SQL relational database in Rust and D, and maybe a NoSQL one as well.
A long time ago, I took on a huge learning task to port the Erlang VM to another language I needed close binding to. The discovery on how the runtime is built and dependent on other libraries and over the years layered and spaghetti implementation provided me enough insight to decide that Erlang is not in itself the panacea of reliance and fault-tolerance that most people will value this runtime for.
I don't quite get this, you are saying Erlang is fault tolerant? I'm confused here because fault tolerance is something that the application using a particular language needs to make sure right? Sorry if I mistaken your statement.
Erlang and especially OTP "forces" you to follow a standard that usually results in highly reliable service powered by concise easy to read and maintain code.
It's more about assuming things will fail by default with good ways of handling it built-in at the language level. A lot of people also find its inventor's thesis enlightening. Here it is:
A lot of these concepts have been carried forward to some popular frameworks and languages. Akka (or Akka.Net & Akkling), is a great example. Railway Oriented Programming (https://fsharpforfunandprofit.com/rop/), has gained some traction in the .net sphere and, along with F#'s Result<'success, 'failure> type, emphasizes structured and exception-less error handling.
It may be a poor mans Erlang, but better than nothing ;)
Handling the concurrency and distributed processes are Erlang's forte. Whether ActorDB leverages Erlang's built-in Menesia (an ACID compliant distributed database), I am not sure. Erlang's Prolog-like syntax will help define the relational model.
They are Erlang's forte. But the common denominator is the current programmer assigned to the task. A 10x developer will invariably take any language and make it shine. Try that with Erlang on a team of Java developers and you are on for some crazy adventures.
Glad you asked. I haven’t refreshed this list for a while so I don’t know how valid the links are, but there should be plenty of useful content here: https://gist.github.com/macintux/6349828
Apples to oranges, really. Cockroach presents a single database view that happens to be sharded and replicated in a fault-tolerant, consistent way. ActorDB doesn't.
With ActorDB you have to design your data model to shard at a high level. For example, you could shard it by user; every user would have its own database that ActorDB will shard and replicate for you. That database is for the most part separate from everything else -- it has its own tables and indexes.
ActorDB provides tools to operate on multiple actors, but they're explicit. For example, you can query across multiple actors, but this requires a small SQL declaration at the beginning of the query to select which actors to query, a bit like a "for each <actors> do <some SQL>".
You can also do transactions across multiple actors, though this uses two-phase commit (which coincidentally is the strategy used by Google Spanner), and requires some locking.
So Cockroach pretends to be a classic RDBMS (databases have tables and indexes, but most apps just use a single database per app), allowing an existing app to be ported with little effort. It would be harder to port an app to ActorDB.
Or more like, how does this compare to any database that runs in production with non-trivial scale and load in a company. Does this database add anything to the data storage landscape that is worth to note? I am not sure. There are pretty good and reliable databases out there.
Another popular SQLite + Raft project is rqlite ( https://github.com/rqlite/rqlite ), but the goals seem slightly different (rqlite only uses distribution for fault-tolerance, not for sharding).
Spanner seems to solve a different problem. It’s effetively a multi-master database where the leader election algorithm can be tuned to ensure that your masters are geographically close to where writes/reads are actually happening.
Many databases are “partitioned” by user anyway so in this case the DB can be smarter if it doesn’t have to handle cross partition queries.
Seems like the idea of partitioning a MySQL database taken to the next level.
That does not always follow. A db designer still has the flexibility to give a client less than serializable/linearizable consistency if the client so wishes.
For example, it is possible for the db to allow direct access to any replica, not just the leader, and if the replica is not part of the quorum, the client will see outdated information. This may be just fine for data that seldom changes, or where a certain amount of staleness is just fine.
From that point of view yes you are right. There is a ton of nuance in CAP. Usually in these discussions the CAP question is if it is eventually consistent or not.
That's the problem... There's not a ton of nuance in CAP. CAP has a formal definition and a system is provably at most CP or AP. Other terms that people often associate with CAP is where the nuance comes in. The reality of the matter is that systems claim to be CP or AP, but most have assumptions or bugs that violate it, so in the end, most systems are just P.
Stale reads technically violate the definition of Consistent in CAP. Using raft means unavailability of some subset of nodes can make the entire system unavailable, violating the definition of Availability in CAP.
So the parent comment's suggestion of serving reads from any replica would result in only a partition tolerant system, without any formal availability or consistency guarantees.
An 'explicitly sharded' SQL database - effectively as many databases as you need with some special sauce to let you run transactions across multiple shards. Neat.
> Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries is within a specific user.
I have this exact use case for a new project I’m working on so I’m fine with the 1-db-per-user approach. ActorDB definitely sounds very interesting, but are there any alternatives to compare it with?
Vitess (https://vitess.io/) is the closest alternative I know of. Vitess came from YouTube, and is used at some other sizable companies. They have a similar model of sharding, they both support the MySQL wire protocol and a binary RPC protocol (gRPC for Vitess, Thrift for ActorDB).
Sure, go play with it and report when you are done. It's always a matter of time for something claiming to be the last one to become irrelevant and forgotten.
How do schema upgrades work? It looks like a specific actor type is tied to a specific schema but I assume the schema upgrades must be eventually consistent unless they are using 2 phase commit across all actors of the type?
Schemas are per actor type and you can have many types. Actors are active or inactive. When you run a query against one, when it becomes active it will check if schema has changed and apply that before processing any query.
> None of these DBs seem to be byzantine fault tolerant, though. Any examples of BFT databases?
Not sure what you mean. Obviously we can't have blockchain level of byzantine fault tolerance against malicious nodes. Only against byzantine failures. But then this is pretty much what CAP answers, where AP systems can handle byzantine failures, because they don't need consensus, while CP can't, because they do. Quorum logic is in the middle here, where you can handle some byzantine failures and have some consistency.
Storage and computing are two layers in a normal application. You are free to switch computing layers as long as storage layer speaking some standard protocol.
AKKA + AKKA persistent, how can you switch the computing layer?
Another option for scaling SQL is the Vitess OSS. Might give it a look. Uses a similar pattern of using an actor up front and then scaling MySQL on the back-end.
However, it has a unique and initially confusing data model where the database is divided into "actors", which are self-contained shards. For example, a database modeled on Hacker News would probably have an actor per user and an actor per story, and probably an actor per thread.
Every actor acts like a self-contained database, with its own set of tables. When you want to query or update data, you first tell it which actor(s) to operate on; but unlike systems like Cassandra, the sharding is explicit, in that the shards have identifiers, and there's no automatic sharding function. Indeed, you can have actors that act like a basic database, e.g. a global, shared list of lookup values could be a single actor.
You have ACID transactions within actors, and you can also do transactions across actors, and queries can also span multiple actors. You can't do joins across actors, as far as I recall. Schema migrations also become interesting, since each actor has its own, entirely separate schema.