Nice, but when you're scaling to such a large size, you probably want to go NoSQ...

z8000 · on Feb 8, 2010

Yes, you could buffer writes. Tokyo Cabinet does this for instance as does Redis. There was an article on the mysql performance blog about Tokyo and how fsync()ing at 1Hz didn't adversely affect performance; there are other issues with TC but that's not my point.

I suppose a system where data could be lost if buffered in RAM for up to 30s could replicate to other hosts in the cluster with the (perhaps naïve) intuition that the probability of > 1 hosts going down within the same 30s interval is low -- "someone's got the data somewhere!". Of course, if they are on the same source of power... :)

One could use a consistent hashing scheme as employed in Riak and others for read operations in such a setup and thus be a bit more fault-tolerant (hand waving a bit).

This is without any hard-earned experience with this other than reading... Thoughts?

ehsanul · on Feb 8, 2010

I have no hard-earned experience with this myself. I've just been thinking about it myself a lot because I have a potentially write-intensive application in the works, and I'm anticipating aperformance issues with writes (but I should probably find out if it actually is an issue). My goal is to squeezing as much performance as I can from as few machines - I'm poor.

To explain my use case.. I'm basically storing a graph in MongoDB, which is a document/object based database. Reads would consist of enumerating a node's edges. Writes would be adding nodes/edges and updating edge weights. It would perform best on reads if I just had one contiguous object for each node, holding information about all nodes connected to it, as well as edge weights. But since some nodes may have too many connected nodes, they are divided in the database by the page they would show up on in the application, as ranked by edge weight. So an object with id 'node1page1' holds information about the top 10 ranked connecting nodes.

Now, if that made any sense, the problem: Edge weights change according to votes from users. So when a weight changes, ranks change, and I've structured things in such a way that the computation of ranks is basically pushed to the write rather than the read. So when someone checks 'node1page2' and votes a node on page 2 up, the corresponding edge weight changes and may require that node's information to be shifted to 'node1page1', while the lowest edge on 'node1page1' is to be shifted to 'node1page2'.

So what's happened here is that while I've made reads fast by requiring just a lookup, writes become complicated. And here's where the buffering of writes comes in. Perhaps I'm overthinking all this, but like I said, I'm poor.

Replication in-memory would definitely be useful, especially for a larger cluster with some spare capacity. You'd probably have good power redundancy when running a large cluster, so single machine failures would be much more common than a total cluster failure I'd think (no hard data here, but seems reasonable enough). So I don't think it's all that naive to believe that generally more than one host would not go down within 30s.

Even if the whole cluster does go down, you should only be doing this with low-value data. In my application's case, losing 30 seconds of votes is no big deal at all, I'd be much more concerned with getting the machine(s) back up.

z8000 · on Feb 9, 2010

I'm in the same low-cost-seeking mode as you, self-funding a teeny "startup" for iPhone services (along with 1000s of other developers) including multiplayer gaming (along with 10s of others developers).

While I can get good dedicated server deals they are still too expensive to me. So I'm diving in deep to learn how to create a almost pure scale-out solution based on a larger (than dedicated) number of VPS instances. Something like

    inexpensive+commodity+more ≥ expensive+powerful+fewer

I really want to be able to have a script that watches load and dynamically adds more capacity. I'm not a fan of calling this "cloud" based but the ideas are pretty similar.

Good luck with your project! BTW, have you looked at Neo4j (which is a graph database that seems to get high marks)?

ehsanul · on Feb 9, 2010

Good luck to you too! You've got an interesting approach to scaling, hope it works out (cheaply).

Yes, I've looked at Neo4j, and I would've tried that if it weren't for this (from their home page):

Neo4j is released under a dual free software/commercial license model (which basically means that it’s “open source” but if you’re interested in using it in commercially, then you must buy a commercial license).

Thus MongoDB. I just heard of Riak the other day, and might check that out too since it has the concept of links built in. I'm also still early enough in the work that I can afford to switch still.