ActorDB – Distributed SQL database

atombender · on June 17, 2018

ActorDB is a neat project. Simplified, it's a distributed SQLite where consistency is guaranteed by the Raft protocol.

However, it has a unique and initially confusing data model where the database is divided into "actors", which are self-contained shards. For example, a database modeled on Hacker News would probably have an actor per user and an actor per story, and probably an actor per thread.

Every actor acts like a self-contained database, with its own set of tables. When you want to query or update data, you first tell it which actor(s) to operate on; but unlike systems like Cassandra, the sharding is explicit, in that the shards have identifiers, and there's no automatic sharding function. Indeed, you can have actors that act like a basic database, e.g. a global, shared list of lookup values could be a single actor.

You have ACID transactions within actors, and you can also do transactions across actors, and queries can also span multiple actors. You can't do joins across actors, as far as I recall. Schema migrations also become interesting, since each actor has its own, entirely separate schema.

dnomad · on June 17, 2018

> You have ACID transactions within actors, and you can also do transactions across actors

I gotta say... this project strikes me as very strange. If you're going to go through all the trouble of sharding your data across a set of HA actors... why not also shard behavior? Co-locate data and behavior! Isn't that a key driver Actor model? Then you wouldn't communicate with your actors using a very limited language like SQL you could send them real domain-specific messages. Most importantly, when it comes to transactions across actors... you don't need it. All you need is guaranteed message delivery to the various actor mailboxes and all that complexity melts away. You need this anyways if you want your actors to be able to actually collaborate and send each other messages. This whole project seems to be designed to eliminate many of the benefits of the actor model...

(I don't like the all-too-common HN cynicism but I also worry when I see stuff like this. Either I'm crazy and missing something obvious... or everybody else is crazy. But then much of what comes out of the Erlang space is... strange to me. That culture seems to have a very unique idea of distributed computing.)

atombender · on June 17, 2018

I think you have to think of the "actors" in ActorDB as shards, not as actors in the "actor model" sense. The nomenclature here is unfortunate, because actors themselves are sharded.

This is a database that permits any application to read and write data without coupling it to a specific language or platform, e.g. Erlang. You just write SQL. People know SQL. The difference is that your app must be shard-aware, which can be an acceptable compromise for apps that wants to scale far. The sharding also gives you some measure of enforced encapsulation, since you can't have foreign keys (that I know) across actors.

This project might be particularly suitable for SaaS-type multitenant systems that need to have a clear boundary between each tenant.

bobwaycott · on June 18, 2018

Admittedly, I read through all the docs in one rapid sitting, but I thought I came across mention of foreign keys across actors being supported. I’ll have to double-check that, though.

smt88 · on June 17, 2018

This model is pretty interesting and might make GDPR compliance easier, if I'm understanding it correctly

bonesss · on June 18, 2018

In terms of GDPR compliance the fundamental element is pseudonymization of internal dataflows and data management -- that'll work with any data store.

I mean, there is some possible advantage by having all users in their own actor-namespace, for bulk deletes and concrete mapping of data-usage, but IME you're always stuck with a need to preserve the operational state of the application and the quality of its legacy data.

Regardless of sharding strategy, somehow your datawarehouse needs to be able to pull out or confirm data from the previous quarter...

abledon · on June 18, 2018

You have a great knack for explaining these concepts clearly for a beginner like myself . If you ever consider writing a technical book I think it would be great!

thanatos_dem · on June 17, 2018

Digging through the docs, I can't find any actual information of the consistency and isolation guarantees other than "consistent" and "ACID". That's an immediate red flag for me. What are the actual isolation characteristics of the database?

Does it use snapshot isolation? Is it serializable? Is it linerizable? With all the great work Kyle Kingsbury (aka Aphyr) has done on the Jepsen tests, it's pretty clear that claiming to be "ACID" with no additional info isn't sufficient for a modern database.

thelastidiot · on June 17, 2018

Most docs for technical projects are written that way these days. The author of an api, project or solution will assume an unquestioned adoption "because it's free" from the randome audience with no specifics on how they are solving and more importantly what's different from other cases. I don't know if this is by laziness and writing procrastination, but assuming that your users will know as much as you do on your journey to assemble a solution has become a frustration when I discover new projects like this one.

RhodesianHunter · on June 17, 2018

Likely has a lot more to do with the fact that the intersection between the population that can create such projects and the population that can write clear effective documentation is exceedingly small.

bonesss · on June 18, 2018

I would also point to Paul Grahams insight around hackers and painters[0]: the population who wants to create such projects is fantasizing about coding, hacking, publishing, bug squashing, and launches. Documentation isn't necessarily pulling them to their PC after a heavy dinner.

-

[0] http://www.paulgraham.com/hp.html

pests · on June 17, 2018

I feel as if you have a bit of a tone against these authors.

Would you rather the software not exist at all?

samatman · on June 17, 2018

SQLite uses serializable isolation by default, this is built on top of that.

I'm also curious how Raft is used and what the resulting guarantees on distributed table operations are, but I'm coming at it assuming good faith, since SQLite is... robust.

atombender · on June 17, 2018

Explained in the docs [1]. They replaced the SQLite WAL with one that uses Raft.

Each actor has its own WAL, and so I suppose only operations within a single actor are consistent. Multi-actor transactions use two-phase commits.

[1] http://www.actordb.com/docs-howitworks.html

thanatos_dem · on June 17, 2018

> SQLite uses serializable isolation by default, this is built on top of that.

That's sort of a meaningless statement tho. Everything in the world is built on top of RAM which is serializable. The real complexity comes in how those foundational blocks are abstracted and combined.

We know that it's not performant at scale to represent an entire database log a single Raft group. So how are the Raft groups structured organized, and how and when do they cross-communicate?

atombender · on June 17, 2018

See the docs [1]. As I understand it, each actor is a Raft group. Each actor has its own independent SQLite WAL.

[1] http://www.actordb.com/docs-howitworks.htmlhttp://www.actord...

malisper · on June 17, 2018

ActorDB sounds like an interesting piece of software. It looks like it's good for simple applications that have large amounts of data. Some thoughts I have based on the docs:

ActorDB has no concurrency at the actor level. This makes it a poor fit for applications that have lots of concurrency around the same pieces of data. A long running read or write on a single actor will lock out any other reads or writes.

Likewise, distributed transactions lock all actors involved in the transaction until the transaction completes.

It seems like reads have to go through a round of Raft. This increases latency for reads. It also decreases throughput, although I'm not sure how big of a deal that is given the lack of concurrency.

It's unclear to me how ActorDB guarantees serializability for multi-actor transactions. You need some way to guarantee two multi-actor transactions will execute in the same order on every actor. Based on the docs, Raft is performed at the actor level and not across multiple actors. ActorDB does use two-phase commit to guarantee atomicity across multiple actors, but there's no description of how it handles serializability.

Based on my reading, ActorDB is good if you have lots of data and your queries have either low concurrency and you don't require high throughput. If you have high concurrency or require high throughput, my guess is ActorDB will be a poor fit.

gregwebs · on June 17, 2018

They have an introductory blog post that explains the database fairly well (summary: it was designed for a shard per user): http://blog.biokoda.com/post/112206754025/why-we-built-actor...

If you are looking at distributed SQLite solutions there is also rqlite/dqlite and bedrock.

adamluzsi · on June 17, 2018

I used to work with Pivotal's Greenplum, which is also a distributed db, and I quiet liked it. Basically a postgresql with syntax sugar on partition , and distribution across servers. I had the pleasure to never need any index in the database.

This sounds to be an interesting project as well.

The question to me always about "how this will makes the project that use it have little learning curve for the new recruits, easy to understand integration in the code level and low maintenance on the long run"

codeadict · on June 17, 2018

Great to see more databases written in Erlang.

alecco · on June 17, 2018

Why would Erlang be good for SQL/relational databases?

giancarlostoro · on June 17, 2018

The core strengths of Erlang: Erlang has built-in support for concurrency, distribution and fault tolerance. There's many languages I enjoy, Erlang I've never learned (syntax and nobody to pay me to work with it) it fully to it's potential. I would also love to see a full blown SQL relational database in Rust and D, and maybe a NoSQL one as well.

See:

http://erlang.org/faq/introduction.html

thelastidiot · on June 17, 2018

A long time ago, I took on a huge learning task to port the Erlang VM to another language I needed close binding to. The discovery on how the runtime is built and dependent on other libraries and over the years layered and spaghetti implementation provided me enough insight to decide that Erlang is not in itself the panacea of reliance and fault-tolerance that most people will value this runtime for.

PeCaN · on June 17, 2018

Erlang has an amazing track record though. It's battle tested in telecom platforms and internet messaging.

Also, I poked around BEAM a fair bit and it seemed just fine to me. It's pretty clean and easy to understand for a fairly complex VM.

antoaravinth · on June 17, 2018

I don't quite get this, you are saying Erlang is fault tolerant? I'm confused here because fault tolerance is something that the application using a particular language needs to make sure right? Sorry if I mistaken your statement.

StreamBright · on June 17, 2018

Erlang and especially OTP "forces" you to follow a standard that usually results in highly reliable service powered by concise easy to read and maintain code.

cmrdporcupine · on June 17, 2018

Erlang comes with a philosophy/style on error handling, and its libraries help a lot as well.

Here's a reasonable summary: https://stackoverflow.com/questions/3760881/how-is-erlang-fa...

nickpsecurity · on June 17, 2018

ferd did a great write-up on how Erlang approaches reliability:

https://ferd.ca/the-zen-of-erlang.html

It's more about assuming things will fail by default with good ways of handling it built-in at the language level. A lot of people also find its inventor's thesis enlightening. Here it is:

http://ftp.nsysu.edu.tw/FreeBSD/ports/distfiles/erlang/armst...

And for those concerned, there's also at least one project to write the native functions in a safer language to reduce their bugs a bit:

https://github.com/hansihe/Rustler

bonesss · on June 18, 2018

A lot of these concepts have been carried forward to some popular frameworks and languages. Akka (or Akka.Net & Akkling), is a great example. Railway Oriented Programming (https://fsharpforfunandprofit.com/rop/), has gained some traction in the .net sphere and, along with F#'s Result<'success, 'failure> type, emphasizes structured and exception-less error handling.

It may be a poor mans Erlang, but better than nothing ;)

tyfon · on June 17, 2018

It's the whole ecosystem with Erlang, OTP (Open Telecom Platform) and how they work with supervisors and monitors etc.

Everything is built so that you can operate a telecom switch and never drop that emergency call.

Erlang (Ericson language) is not necessary, you can also program for this in Elixir, although personally I prefer Erlang.

maqbool · on June 18, 2018

Look up the OTP Supervision trees

gtycomb · on June 17, 2018

Handling the concurrency and distributed processes are Erlang's forte. Whether ActorDB leverages Erlang's built-in Menesia (an ACID compliant distributed database), I am not sure. Erlang's Prolog-like syntax will help define the relational model.

thelastidiot · on June 17, 2018

They are Erlang's forte. But the common denominator is the current programmer assigned to the task. A 10x developer will invariably take any language and make it shine. Try that with Erlang on a team of Java developers and you are on for some crazy adventures.

atombender · on June 17, 2018

The database itself is SQLite.

antoaravinth · on June 17, 2018

Slightly off topic but I see Go is better fit too. When i started learning go I read the source of Buntdb

https://github.com/tidwall/buntdb

It was very well written and easy to pick up as well.

a_c · on June 17, 2018

Would love to learn more about Erlang. Does anyone have any recommendation on blogs about Erlang, perhaps something similar to http://antirez.com ?

macintux · on June 18, 2018

Glad you asked. I haven’t refreshed this list for a while so I don’t know how valid the links are, but there should be plenty of useful content here: https://gist.github.com/macintux/6349828

01walid · on June 17, 2018

How does this compare to CockroachDB?

atombender · on June 17, 2018

Apples to oranges, really. Cockroach presents a single database view that happens to be sharded and replicated in a fault-tolerant, consistent way. ActorDB doesn't.

With ActorDB you have to design your data model to shard at a high level. For example, you could shard it by user; every user would have its own database that ActorDB will shard and replicate for you. That database is for the most part separate from everything else -- it has its own tables and indexes.

ActorDB provides tools to operate on multiple actors, but they're explicit. For example, you can query across multiple actors, but this requires a small SQL declaration at the beginning of the query to select which actors to query, a bit like a "for each <actors> do <some SQL>".

You can also do transactions across multiple actors, though this uses two-phase commit (which coincidentally is the strategy used by Google Spanner), and requires some locking.

So Cockroach pretends to be a classic RDBMS (databases have tables and indexes, but most apps just use a single database per app), allowing an existing app to be ported with little effort. It would be harder to port an app to ActorDB.

StreamBright · on June 17, 2018

Or more like, how does this compare to any database that runs in production with non-trivial scale and load in a company. Does this database add anything to the data storage landscape that is worth to note? I am not sure. There are pretty good and reliable databases out there.

mappu · on June 17, 2018

Another popular SQLite + Raft project is rqlite ( https://github.com/rqlite/rqlite ), but the goals seem slightly different (rqlite only uses distribution for fault-tolerance, not for sharding).

sessy · on June 17, 2018

Any idea if the creators will offer this "as-a-service" in the near future. How does this compare to Google's Spanner?

davewritescode · on June 17, 2018

Spanner seems to solve a different problem. It’s effetively a multi-master database where the leader election algorithm can be tuned to ensure that your masters are geographically close to where writes/reads are actually happening.

Many databases are “partitioned” by user anyway so in this case the DB can be smarter if it doesn’t have to handle cross partition queries.

Seems like the idea of partitioning a MySQL database taken to the next level.

pknopf · on June 17, 2018

What part of CAP is being sacrified here? I'm not seeing it.

biokoda · on June 17, 2018

It's raft; thus firmly CP.

sriram_malhar · on June 17, 2018

That does not always follow. A db designer still has the flexibility to give a client less than serializable/linearizable consistency if the client so wishes.

For example, it is possible for the db to allow direct access to any replica, not just the leader, and if the replica is not part of the quorum, the client will see outdated information. This may be just fine for data that seldom changes, or where a certain amount of staleness is just fine.

biokoda · on June 17, 2018

From that point of view yes you are right. There is a ton of nuance in CAP. Usually in these discussions the CAP question is if it is eventually consistent or not.

thanatos_dem · on June 17, 2018

That's the problem... There's not a ton of nuance in CAP. CAP has a formal definition and a system is provably at most CP or AP. Other terms that people often associate with CAP is where the nuance comes in. The reality of the matter is that systems claim to be CP or AP, but most have assumptions or bugs that violate it, so in the end, most systems are just P.

Stale reads technically violate the definition of Consistent in CAP. Using raft means unavailability of some subset of nodes can make the entire system unavailable, violating the definition of Availability in CAP.

So the parent comment's suggestion of serving reads from any replica would result in only a partition tolerant system, without any formal availability or consistency guarantees.

thomasfedb · on June 17, 2018

An 'explicitly sharded' SQL database - effectively as many databases as you need with some special sauce to let you run transactions across multiple shards. Neat.

makmanalp · on June 17, 2018

Spanner is a bit like this too if I recall correctly.

hestefisk · on June 17, 2018

It’s like having your cake and eating it too.

davewritescode · on June 17, 2018

I suspect that running transactions over multiple partitions would incur more coordination overhead and probably won’t perform well.

stavros · on June 17, 2018

Probably, but what's the alternative?

verelo · on June 17, 2018

I don't get the feeling they had an alternative (I don't either!), i suspect it was just commentary.

stavros · on June 17, 2018

Yeah, it must be. I don't think there's a way to generically perform well on distributed queries, so this is probably as good as it gets.

davewritescode · on June 17, 2018

Depends on the use case. There’s lots of ways that databases can can make trade offs on CAP. Unfortunately there’s not a single silver bullet.

You’ll make different choices for an application that’s processing clickstream data than one that’s managing highly relational data.

imtringued · on June 18, 2018

1. Run two beefy nodes and accept dataloss of unreplicated changes during failover.

2. Use eventual consistency (not suitable for every workload)

crudbug · on June 17, 2018

I think the data model is similar to virtual actors [0] which have persistent state. There is also CLR and JVM specific implementations of this model.

[0] https://www.microsoft.com/en-us/research/project/orleans-vir...

vincentmarle · on June 17, 2018

> Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries is within a specific user.

I have this exact use case for a new project I’m working on so I’m fine with the 1-db-per-user approach. ActorDB definitely sounds very interesting, but are there any alternatives to compare it with?

antoncohen · on June 17, 2018

Vitess (https://vitess.io/) is the closest alternative I know of. Vitess came from YouTube, and is used at some other sizable companies. They have a similar model of sharding, they both support the MySQL wire protocol and a binary RPC protocol (gRPC for Vitess, Thrift for ActorDB).

_pgmf · on June 17, 2018

Anybody using this in production? What particular problem were you solving with actordb?

gigatexal · on June 17, 2018

The readme sounds like it promises the end to all databases. I’d like to play with it.

AtlasBarfed · on June 17, 2018

CAP guarantees there isn't "one database to end them all"

You pick your poison.

Glossing over / ignoring / implying they solved CAP is very typical for database marketing.

thelastidiot · on June 17, 2018

Sure, go play with it and report when you are done. It's always a matter of time for something claiming to be the last one to become irrelevant and forgotten.

tybit · on June 18, 2018

How do schema upgrades work? It looks like a specific actor type is tied to a specific schema but I assume the schema upgrades must be eventually consistent unless they are using 2 phase commit across all actors of the type?

biokoda · on June 18, 2018

Schemas are per actor type and you can have many types. Actors are active or inactive. When you run a query against one, when it becomes active it will check if schema has changed and apply that before processing any query.

aisofteng · on June 17, 2018

SQLite with Raft. Nice project.

EGreg · on June 17, 2018

ActorDB might be great for distributed systems. How does it compare to CocktoachDB? The latter does transaction planning across shards?

None of these DBs seem to be byzantine fault tolerant, though. Any examples of BFT databases?

zzzcpan · on June 17, 2018

> None of these DBs seem to be byzantine fault tolerant, though. Any examples of BFT databases?

Not sure what you mean. Obviously we can't have blockchain level of byzantine fault tolerance against malicious nodes. Only against byzantine failures. But then this is pretty much what CAP answers, where AP systems can handle byzantine failures, because they don't need consensus, while CP can't, because they do. Quorum logic is in the middle here, where you can handle some byzantine failures and have some consistency.

gigatexal · on June 17, 2018

Building on SQLite means no windowing functions which is a deal killer for me.

alecco · on June 17, 2018

> with the scalability of a KV store, while keeping the query capabilities of a relational database.

So not doing good query performance for OLAP/DS. And "core" DB is in Erlang.

atombender · on June 17, 2018

The core database is SQLite.

alecco · on June 17, 2018

I love SQLite. But it is not good for OLAP or TCP-DS kind of workloads.

hyc_symas · on June 18, 2018

The actual data store is LMDB.

nine_k · on June 18, 2018

I wonder how hard would it be to replace SQLite with Postgres or Maria. It would add interesting in-actor query capabilities.

polskibus · on June 17, 2018

I wonder what are the benefits in comparison to using Akka Persistence with whatever database you want?

shuoli84 · on June 17, 2018

Then you locked in Akka?

haggy · on June 17, 2018

The author of this project is locked into Raft. What's the difference?

shuoli84 · on June 17, 2018

They are different layers of abstraction.

Storage and computing are two layers in a normal application. You are free to switch computing layers as long as storage layer speaking some standard protocol.

AKKA + AKKA persistent, how can you switch the computing layer?

qaq · on June 17, 2018

RAFT is an algorithm and Akka is a "framework" ?

fermuch · on June 17, 2018

The connection is made over the mysql protocol, and it is SQL, so you could migrate to mysql/mariadb if need arises.

rch · on June 17, 2018

Akka includes a facility for distributed data based on CRDTs, but I can't recall what it uses for consensus.

nickmancol · on June 18, 2018

jacksmith21006 · on June 17, 2018

Another option for scaling SQL is the Vitess OSS. Might give it a look. Uses a similar pattern of using an actor up front and then scaling MySQL on the back-end.

https://vitess.io/overview/

Would like a similar solution but with Postgres on the back end instead.

yeukhon · on June 17, 2018

Yet another distributed database. How many more do we need? Time to just hide in the cave again. ¯\_(ツ)_/¯