Hacker News new | past | comments | ask | show | jobs | submit login
ActorDB – Distributed SQL database (github.com/biokoda)
284 points by iamd3vil on June 17, 2018 | hide | past | favorite | 85 comments



ActorDB is a neat project. Simplified, it's a distributed SQLite where consistency is guaranteed by the Raft protocol.

However, it has a unique and initially confusing data model where the database is divided into "actors", which are self-contained shards. For example, a database modeled on Hacker News would probably have an actor per user and an actor per story, and probably an actor per thread.

Every actor acts like a self-contained database, with its own set of tables. When you want to query or update data, you first tell it which actor(s) to operate on; but unlike systems like Cassandra, the sharding is explicit, in that the shards have identifiers, and there's no automatic sharding function. Indeed, you can have actors that act like a basic database, e.g. a global, shared list of lookup values could be a single actor.

You have ACID transactions within actors, and you can also do transactions across actors, and queries can also span multiple actors. You can't do joins across actors, as far as I recall. Schema migrations also become interesting, since each actor has its own, entirely separate schema.


> You have ACID transactions within actors, and you can also do transactions across actors

I gotta say... this project strikes me as very strange. If you're going to go through all the trouble of sharding your data across a set of HA actors... why not also shard behavior? Co-locate data and behavior! Isn't that a key driver Actor model? Then you wouldn't communicate with your actors using a very limited language like SQL you could send them real domain-specific messages. Most importantly, when it comes to transactions across actors... you don't need it. All you need is guaranteed message delivery to the various actor mailboxes and all that complexity melts away. You need this anyways if you want your actors to be able to actually collaborate and send each other messages. This whole project seems to be designed to eliminate many of the benefits of the actor model...

(I don't like the all-too-common HN cynicism but I also worry when I see stuff like this. Either I'm crazy and missing something obvious... or everybody else is crazy. But then much of what comes out of the Erlang space is... strange to me. That culture seems to have a very unique idea of distributed computing.)


I think you have to think of the "actors" in ActorDB as shards, not as actors in the "actor model" sense. The nomenclature here is unfortunate, because actors themselves are sharded.

This is a database that permits any application to read and write data without coupling it to a specific language or platform, e.g. Erlang. You just write SQL. People know SQL. The difference is that your app must be shard-aware, which can be an acceptable compromise for apps that wants to scale far. The sharding also gives you some measure of enforced encapsulation, since you can't have foreign keys (that I know) across actors.

This project might be particularly suitable for SaaS-type multitenant systems that need to have a clear boundary between each tenant.


Admittedly, I read through all the docs in one rapid sitting, but I thought I came across mention of foreign keys across actors being supported. I’ll have to double-check that, though.


This model is pretty interesting and might make GDPR compliance easier, if I'm understanding it correctly


In terms of GDPR compliance the fundamental element is pseudonymization of internal dataflows and data management -- that'll work with any data store.

I mean, there is some possible advantage by having all users in their own actor-namespace, for bulk deletes and concrete mapping of data-usage, but IME you're always stuck with a need to preserve the operational state of the application and the quality of its legacy data.

Regardless of sharding strategy, somehow your datawarehouse needs to be able to pull out or confirm data from the previous quarter...


You have a great knack for explaining these concepts clearly for a beginner like myself . If you ever consider writing a technical book I think it would be great!


Digging through the docs, I can't find any actual information of the consistency and isolation guarantees other than "consistent" and "ACID". That's an immediate red flag for me. What are the actual isolation characteristics of the database?

Does it use snapshot isolation? Is it serializable? Is it linerizable? With all the great work Kyle Kingsbury (aka Aphyr) has done on the Jepsen tests, it's pretty clear that claiming to be "ACID" with no additional info isn't sufficient for a modern database.


Most docs for technical projects are written that way these days. The author of an api, project or solution will assume an unquestioned adoption "because it's free" from the randome audience with no specifics on how they are solving and more importantly what's different from other cases. I don't know if this is by laziness and writing procrastination, but assuming that your users will know as much as you do on your journey to assemble a solution has become a frustration when I discover new projects like this one.


Likely has a lot more to do with the fact that the intersection between the population that can create such projects and the population that can write clear effective documentation is exceedingly small.


I would also point to Paul Grahams insight around hackers and painters[0]: the population who wants to create such projects is fantasizing about coding, hacking, publishing, bug squashing, and launches. Documentation isn't necessarily pulling them to their PC after a heavy dinner.

-

[0] http://www.paulgraham.com/hp.html


I feel as if you have a bit of a tone against these authors.

Would you rather the software not exist at all?


SQLite uses serializable isolation by default, this is built on top of that.

I'm also curious how Raft is used and what the resulting guarantees on distributed table operations are, but I'm coming at it assuming good faith, since SQLite is... robust.


Explained in the docs [1]. They replaced the SQLite WAL with one that uses Raft.

Each actor has its own WAL, and so I suppose only operations within a single actor are consistent. Multi-actor transactions use two-phase commits.

[1] http://www.actordb.com/docs-howitworks.html


> SQLite uses serializable isolation by default, this is built on top of that.

That's sort of a meaningless statement tho. Everything in the world is built on top of RAM which is serializable. The real complexity comes in how those foundational blocks are abstracted and combined.

We know that it's not performant at scale to represent an entire database log a single Raft group. So how are the Raft groups structured organized, and how and when do they cross-communicate?


See the docs [1]. As I understand it, each actor is a Raft group. Each actor has its own independent SQLite WAL.

[1] http://www.actordb.com/docs-howitworks.htmlhttp://www.actord...


ActorDB sounds like an interesting piece of software. It looks like it's good for simple applications that have large amounts of data. Some thoughts I have based on the docs:

ActorDB has no concurrency at the actor level. This makes it a poor fit for applications that have lots of concurrency around the same pieces of data. A long running read or write on a single actor will lock out any other reads or writes.

Likewise, distributed transactions lock all actors involved in the transaction until the transaction completes.

It seems like reads have to go through a round of Raft. This increases latency for reads. It also decreases throughput, although I'm not sure how big of a deal that is given the lack of concurrency.

It's unclear to me how ActorDB guarantees serializability for multi-actor transactions. You need some way to guarantee two multi-actor transactions will execute in the same order on every actor. Based on the docs, Raft is performed at the actor level and not across multiple actors. ActorDB does use two-phase commit to guarantee atomicity across multiple actors, but there's no description of how it handles serializability.

Based on my reading, ActorDB is good if you have lots of data and your queries have either low concurrency and you don't require high throughput. If you have high concurrency or require high throughput, my guess is ActorDB will be a poor fit.


They have an introductory blog post that explains the database fairly well (summary: it was designed for a shard per user): http://blog.biokoda.com/post/112206754025/why-we-built-actor...

If you are looking at distributed SQLite solutions there is also rqlite/dqlite and bedrock.


I used to work with Pivotal's Greenplum, which is also a distributed db, and I quiet liked it. Basically a postgresql with syntax sugar on partition , and distribution across servers. I had the pleasure to never need any index in the database.

This sounds to be an interesting project as well.

The question to me always about "how this will makes the project that use it have little learning curve for the new recruits, easy to understand integration in the code level and low maintenance on the long run"


Great to see more databases written in Erlang.


Why would Erlang be good for SQL/relational databases?


The core strengths of Erlang: Erlang has built-in support for concurrency, distribution and fault tolerance. There's many languages I enjoy, Erlang I've never learned (syntax and nobody to pay me to work with it) it fully to it's potential. I would also love to see a full blown SQL relational database in Rust and D, and maybe a NoSQL one as well.

See:

http://erlang.org/faq/introduction.html


A long time ago, I took on a huge learning task to port the Erlang VM to another language I needed close binding to. The discovery on how the runtime is built and dependent on other libraries and over the years layered and spaghetti implementation provided me enough insight to decide that Erlang is not in itself the panacea of reliance and fault-tolerance that most people will value this runtime for.


Erlang has an amazing track record though. It's battle tested in telecom platforms and internet messaging.

Also, I poked around BEAM a fair bit and it seemed just fine to me. It's pretty clean and easy to understand for a fairly complex VM.


I don't quite get this, you are saying Erlang is fault tolerant? I'm confused here because fault tolerance is something that the application using a particular language needs to make sure right? Sorry if I mistaken your statement.


Erlang and especially OTP "forces" you to follow a standard that usually results in highly reliable service powered by concise easy to read and maintain code.


Erlang comes with a philosophy/style on error handling, and its libraries help a lot as well.

Here's a reasonable summary: https://stackoverflow.com/questions/3760881/how-is-erlang-fa...


ferd did a great write-up on how Erlang approaches reliability:

https://ferd.ca/the-zen-of-erlang.html

It's more about assuming things will fail by default with good ways of handling it built-in at the language level. A lot of people also find its inventor's thesis enlightening. Here it is:

http://ftp.nsysu.edu.tw/FreeBSD/ports/distfiles/erlang/armst...

And for those concerned, there's also at least one project to write the native functions in a safer language to reduce their bugs a bit:

https://github.com/hansihe/Rustler


A lot of these concepts have been carried forward to some popular frameworks and languages. Akka (or Akka.Net & Akkling), is a great example. Railway Oriented Programming (https://fsharpforfunandprofit.com/rop/), has gained some traction in the .net sphere and, along with F#'s Result<'success, 'failure> type, emphasizes structured and exception-less error handling.

It may be a poor mans Erlang, but better than nothing ;)


It's the whole ecosystem with Erlang, OTP (Open Telecom Platform) and how they work with supervisors and monitors etc.

Everything is built so that you can operate a telecom switch and never drop that emergency call.

Erlang (Ericson language) is not necessary, you can also program for this in Elixir, although personally I prefer Erlang.


Look up the OTP Supervision trees


Handling the concurrency and distributed processes are Erlang's forte. Whether ActorDB leverages Erlang's built-in Menesia (an ACID compliant distributed database), I am not sure. Erlang's Prolog-like syntax will help define the relational model.


They are Erlang's forte. But the common denominator is the current programmer assigned to the task. A 10x developer will invariably take any language and make it shine. Try that with Erlang on a team of Java developers and you are on for some crazy adventures.


The database itself is SQLite.


Slightly off topic but I see Go is better fit too. When i started learning go I read the source of Buntdb

https://github.com/tidwall/buntdb

It was very well written and easy to pick up as well.


Would love to learn more about Erlang. Does anyone have any recommendation on blogs about Erlang, perhaps something similar to http://antirez.com ?


Glad you asked. I haven’t refreshed this list for a while so I don’t know how valid the links are, but there should be plenty of useful content here: https://gist.github.com/macintux/6349828


How does this compare to CockroachDB?


Apples to oranges, really. Cockroach presents a single database view that happens to be sharded and replicated in a fault-tolerant, consistent way. ActorDB doesn't.

With ActorDB you have to design your data model to shard at a high level. For example, you could shard it by user; every user would have its own database that ActorDB will shard and replicate for you. That database is for the most part separate from everything else -- it has its own tables and indexes.

ActorDB provides tools to operate on multiple actors, but they're explicit. For example, you can query across multiple actors, but this requires a small SQL declaration at the beginning of the query to select which actors to query, a bit like a "for each <actors> do <some SQL>".

You can also do transactions across multiple actors, though this uses two-phase commit (which coincidentally is the strategy used by Google Spanner), and requires some locking.

So Cockroach pretends to be a classic RDBMS (databases have tables and indexes, but most apps just use a single database per app), allowing an existing app to be ported with little effort. It would be harder to port an app to ActorDB.


Or more like, how does this compare to any database that runs in production with non-trivial scale and load in a company. Does this database add anything to the data storage landscape that is worth to note? I am not sure. There are pretty good and reliable databases out there.


Another popular SQLite + Raft project is rqlite ( https://github.com/rqlite/rqlite ), but the goals seem slightly different (rqlite only uses distribution for fault-tolerance, not for sharding).


Any idea if the creators will offer this "as-a-service" in the near future. How does this compare to Google's Spanner?


Spanner seems to solve a different problem. It’s effetively a multi-master database where the leader election algorithm can be tuned to ensure that your masters are geographically close to where writes/reads are actually happening.

Many databases are “partitioned” by user anyway so in this case the DB can be smarter if it doesn’t have to handle cross partition queries.

Seems like the idea of partitioning a MySQL database taken to the next level.


What part of CAP is being sacrified here? I'm not seeing it.


It's raft; thus firmly CP.


That does not always follow. A db designer still has the flexibility to give a client less than serializable/linearizable consistency if the client so wishes.

For example, it is possible for the db to allow direct access to any replica, not just the leader, and if the replica is not part of the quorum, the client will see outdated information. This may be just fine for data that seldom changes, or where a certain amount of staleness is just fine.


From that point of view yes you are right. There is a ton of nuance in CAP. Usually in these discussions the CAP question is if it is eventually consistent or not.


That's the problem... There's not a ton of nuance in CAP. CAP has a formal definition and a system is provably at most CP or AP. Other terms that people often associate with CAP is where the nuance comes in. The reality of the matter is that systems claim to be CP or AP, but most have assumptions or bugs that violate it, so in the end, most systems are just P.

Stale reads technically violate the definition of Consistent in CAP. Using raft means unavailability of some subset of nodes can make the entire system unavailable, violating the definition of Availability in CAP.

So the parent comment's suggestion of serving reads from any replica would result in only a partition tolerant system, without any formal availability or consistency guarantees.


An 'explicitly sharded' SQL database - effectively as many databases as you need with some special sauce to let you run transactions across multiple shards. Neat.


Spanner is a bit like this too if I recall correctly.


It’s like having your cake and eating it too.


I suspect that running transactions over multiple partitions would incur more coordination overhead and probably won’t perform well.


Probably, but what's the alternative?


I don't get the feeling they had an alternative (I don't either!), i suspect it was just commentary.


Yeah, it must be. I don't think there's a way to generically perform well on distributed queries, so this is probably as good as it gets.


Depends on the use case. There’s lots of ways that databases can can make trade offs on CAP. Unfortunately there’s not a single silver bullet.

You’ll make different choices for an application that’s processing clickstream data than one that’s managing highly relational data.


1. Run two beefy nodes and accept dataloss of unreplicated changes during failover.

2. Use eventual consistency (not suitable for every workload)


I think the data model is similar to virtual actors [0] which have persistent state. There is also CLR and JVM specific implementations of this model.

[0] https://www.microsoft.com/en-us/research/project/orleans-vir...


> Think of running a large mail service, dropbox, evernote, etc. They all require server side storage for user data, but the vast majority of queries is within a specific user.

I have this exact use case for a new project I’m working on so I’m fine with the 1-db-per-user approach. ActorDB definitely sounds very interesting, but are there any alternatives to compare it with?


Vitess (https://vitess.io/) is the closest alternative I know of. Vitess came from YouTube, and is used at some other sizable companies. They have a similar model of sharding, they both support the MySQL wire protocol and a binary RPC protocol (gRPC for Vitess, Thrift for ActorDB).


Anybody using this in production? What particular problem were you solving with actordb?


The readme sounds like it promises the end to all databases. I’d like to play with it.


CAP guarantees there isn't "one database to end them all"

You pick your poison.

Glossing over / ignoring / implying they solved CAP is very typical for database marketing.


Sure, go play with it and report when you are done. It's always a matter of time for something claiming to be the last one to become irrelevant and forgotten.


How do schema upgrades work? It looks like a specific actor type is tied to a specific schema but I assume the schema upgrades must be eventually consistent unless they are using 2 phase commit across all actors of the type?


Schemas are per actor type and you can have many types. Actors are active or inactive. When you run a query against one, when it becomes active it will check if schema has changed and apply that before processing any query.


SQLite with Raft. Nice project.


ActorDB might be great for distributed systems. How does it compare to CocktoachDB? The latter does transaction planning across shards?

None of these DBs seem to be byzantine fault tolerant, though. Any examples of BFT databases?


> None of these DBs seem to be byzantine fault tolerant, though. Any examples of BFT databases?

Not sure what you mean. Obviously we can't have blockchain level of byzantine fault tolerance against malicious nodes. Only against byzantine failures. But then this is pretty much what CAP answers, where AP systems can handle byzantine failures, because they don't need consensus, while CP can't, because they do. Quorum logic is in the middle here, where you can handle some byzantine failures and have some consistency.


Building on SQLite means no windowing functions which is a deal killer for me.


> with the scalability of a KV store, while keeping the query capabilities of a relational database.

So not doing good query performance for OLAP/DS. And "core" DB is in Erlang.


The core database is SQLite.


I love SQLite. But it is not good for OLAP or TCP-DS kind of workloads.


The actual data store is LMDB.


I wonder how hard would it be to replace SQLite with Postgres or Maria. It would add interesting in-actor query capabilities.


I wonder what are the benefits in comparison to using Akka Persistence with whatever database you want?


Then you locked in Akka?


The author of this project is locked into Raft. What's the difference?


They are different layers of abstraction.

Storage and computing are two layers in a normal application. You are free to switch computing layers as long as storage layer speaking some standard protocol.

AKKA + AKKA persistent, how can you switch the computing layer?


RAFT is an algorithm and Akka is a "framework" ?


The connection is made over the mysql protocol, and it is SQL, so you could migrate to mysql/mariadb if need arises.


Akka includes a facility for distributed data based on CRDTs, but I can't recall what it uses for consensus.



Another option for scaling SQL is the Vitess OSS. Might give it a look. Uses a similar pattern of using an actor up front and then scaling MySQL on the back-end.

https://vitess.io/overview/

Would like a similar solution but with Postgres on the back end instead.


Yet another distributed database. How many more do we need? Time to just hide in the cave again. ¯\_(ツ)_/¯




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: