Redis Sharding at Craigslist

Smerity · on Feb 27, 2011

I love Redis a great deal but I can't wait for the persistence issues to be worked out.

The fact that they're restricting the working memory set to 24GB on a 32GB machine purely to prevent death by SAVE or BGSAVE is quite concerning. For my personal use cases the speed of the SAVE can become drastically long as well - this becomes even worse if you need to factor in a hosting solution with temperamental disk I/O such as Amazon EC2.

Redis Diskstore is something I'm eagerly waiting for but for now the issues with the Redis persistence layer meant I can't use Redis in my recent project. Here's to a speedy release of Redis 2.4! =]

davidhollander · on Feb 27, 2011

>I can't wait for the persistence issues to be worked out.

Why use an in memory database for persistence instead of for cacheing? Considering that RAM is the scarcest and thus most valuable resource on nearly every server, in memory DBs like Redis are better used for storing already rendered data ready for output rather than every bit of raw data that always needs to be duplicated or logged on disk at the end of the day anyway. Persistence==disk!=memory.

>Redis Diskstore is something I'm eagerly waiting for

If your data can be represented using hashmaps (unordered) and b+trees (ordered), check out TokyoCabinet for on-disk persistance. It has the fastest on-disk hashmaps and b+trees (of non-fixed size) and has been around for a while.

Smerity · on Feb 27, 2011

The point is Redis is beginning to move away from being just an in-memory database. See my other comment for a more detailed answer as to the use case.

Persistent caches are an important commodity for many websites. For some services trying to handle standard traffic patterns with an empty cache is suicide and can cause cascading failures across the board. Without a strong persistence solution Redis can't be used here. antirez mentioned he considered it a strange move for Reddit to use to Cassandra and not Redis as a persistent cache[1] and I think the issues with Redis persistence may well have caused that.

Additionally I still think there's reasonable ground for a database that's primarily in-memory but drops least used data off to disk. The vast majority of the web follows a Zipfian/long tail distribution so although your "working dataset" can be far larger than RAM your actual "active working dataset" can fit in there. Why trade away the advantages of an in-memory data structure driven database when almost all your queries can be satisfied in this manner?

[1] http://www.reddit.com/r/programming/comments/bcqhi/reddits_n...

davidhollander · on Feb 27, 2011

>for a database that's primarily in-memory but drops least used data off to disk

>so although your "working dataset" can be far larger than RAM your actual "active working dataset" can fit in there

To me that's the exact same pattern as cacheing, merely stated in reverse. If you want full persistence, everything's going to have to hit the disk anyway no matter how you phrase it and how and when you store it. In regards to your user logs problem stated below, that sounds easy to parallelize. I would hash+modulo the username to a data server number, on the selected data server traverse a B+ tree of dates\entries in order, and buffer or stream until 20 entries matching the usernames have been retrieved in response. High performance and fully persistent.

Waiting for a new memory+disk database pattern that does everything more intelligently and faster than every other disk+memory pattern seems like procrastination of parallelization, which is the unavoidable long term answer to these problems.

simonw · on Feb 27, 2011

Just out of interest, what would you have been using Redis for in your recent project?

I'm still using MySQL as my primary store, with a Redis denormalised layer on the side for set intersections, random item selection and the like.

Smerity · on Feb 27, 2011

It's similar to a per user log - a user has an entry from time X, time Y and so on and you're likely only interested in the last N entries. Let's say for this case that N = 20.

This is a near perfect use case for Redis lists. I say near perfect as you still have the problem that if you have an incredibly active user (which we have many of) with thousands of log entries (of which they'd only ever likely be looking at the last few hundred) as you still need to keep that entire chunk in memory. It'd be great if you could archive past the last 1000 onto disk - you can do that with client side logic but it's still a bit of a pain and then you need to hope that Redis knows that those "historical" lists don't need to clog up the cache.

MySQL isn't up to the task in that case as if those entries aren't cached (which they likely aren't) then it needs to retrieve 20 rows from disk. Those rows on disk aren't sequential and with a single access on disk being 10ms that's 0.2 seconds per query or a total of 5 queries per second. The project additionally runs on EC2 so disk IO (especially random reads) are temperamental at best. If the data was grouped or contiguous (as in the case of Redis Diskstore, BigTable style DBs etc) then you may be able to get upwards of 100 queries per second if a random read is 10 ms.

Even if we were using Redis or memcache in front of MySQL and requests came in only as we could handle them we'd only be able to serve ~430,000 requests per day. We have more users than that, the queries don't come in consistently (i.e. we still have to worry about peaks) and that's additionally not worrying about cache invalidation. Due to how slow this is it'd be nice to take snapshots to make Redis a persistent cache but then you hit the same initial problems I mentioned before.

The issue with using Redis as the primary datastore is related to the persistence issues I mentioned previously. Saves can end up taking a long time and chew up a lot more memory than I'm comfortable with. This is complicated by the fact that it'd be preferable to have some chunk of the cold dataset stored on disk but the Virtual Memory persistence option is not really suited and additionally makes saves even worse (as now you're reading from a slow disk IO device [the VM files] to save to a different slow disk IO device [the database backup] whilst trying to serve the occasional cold query from the VM).

Diskstore would solve the problems of persistent storage and slow disk IO (as the SAVE wouldn't thrash the VM disk whilst Redis was trying to use the VM disk to serve cold queries) but unfortunately we can't wait for it. I truly am looking forward to it being released however and will look to it for other future projects =]

ichilton · on Feb 27, 2011

> I love Redis a great deal but I can't wait for the persistence issues to be worked out.

Please could you provide links detailing these issues further?

Thanks.

cagenut · on Feb 27, 2011

For extra context, here's a podcast where Jeremy describes the rest of the craigslist cluster architecture:

http://perseus.franklins.net/hanselminutes_0199.mp3

Thank you for taking the effort to share this stuff, I've learned so much from you.

apu · on Feb 27, 2011

Great post! I had a few questions:

1. Do the slaves write to disk? I assume that would incur no performance penalty, as I believe slave writes don't slow down the master's performance -- only the rate of replication.

2. Async operations seem like a great idea, and in fact I was toying around with implementing something like that myself. How far along is your implementation, and how does it seem to behave so far?

3. About your proposed 'automatic slave read' idea: The way I understand what you wrote, it seems like you would end up incurring the same cost on the master and the slave, i.e., you make the request on the server, and if it takes longer than X msecs, you then issue the request on the slave. Instead, is there any merit to having a redis command that returns how much time is likely going to be required for a given command? It wouldn't have to be very precise -- I imagine even the rough order of magnitude would be helpful for deciding whether to get the data from a given master or its slave?

jzawodn · on Feb 27, 2011

#1: currently no, but I'm about to change that and have them do a SAVE every few hours or so.

#2: my implementation is pretty far along in my testing but I do need to polish it up, update the docs, and test some edge cases

#3: the idea with the slave reads for stuff that is CPU expensive or could slow down other requests (doing large multi-set operations, for example)

slay2k · on Feb 27, 2011

No disrespect towards you, Jeremy, you're pretty awesome in my book (literally.. my High Perf MySQL book).

But with Craigslist being the most closed-minded and developer-hating organization I've ever come across, I don't particularly give a rat's ass what it's built on.

I wish it wasn't so, because I generally love posts like this, but if a dictator's employees start giving tours of the mansion, you certainly won't find me dazzled by the motion-sensor water fountains..

jzawodn · on Feb 27, 2011

Uh. What happened to evoke this response?

You can email jzawodn@craigslist if you don't want to say so in public, but we're not anti-developer. We are anti-abuse (which we see A LOT of) and anti-TOS violations.

So I'm curious to understand your position.

slay2k · on Feb 27, 2011

Thanks Jeremy. I'll choose to post here as I didn't intend for my original comment to come off hostile, so I'll try and clarify what I meant.

The comment stemmed from frustration at the fact that, for a very long time, CL has been killing all creative efforts to improve or build on it. Mature sites used by millions shut down after years of operation, despite loud user protests[1][2], and those killed in utero[3] have been the norm. I realize that it's easy to point to the TOS and call it a day, but the TOS is neither consistent[4] nor clear.

Friends of mine are running startups that get some of their data from CL, and try to stay under the radar because of fears and uncertainties. They aren't doing anything remotely shady. To them, CL's decision to kill a site comes on a whim, and nobody really knows what's considered okay WRT the TOS and what isn't. I mean, if Craig himself can't provide a definitive answer[5], then surely you can see how this could become frustrating for developers.

I'm not used to any kind of openness coming from CL on the dev front, and your post has been the first one I've seen in that category, so I want to apologize if you caught the brunt of my frustration.

[1] http://blog.claz.org/#post-94

[2] http://downloadsquad.switched.com/2007/06/08/jim-buckmeister...

[3] http://techcrunch.com/2009/12/01/craigslist-yahoo-pipes-flip...

[4] http://www.housingmaps.com/ -- one of several that has been up since 2005

[5] http://romy.posterous.com/dont-be-evil-craigslist

apu · on Feb 27, 2011

WTF is this? I can't decide if this is some personal beef, trolling, or something else.

lug123 · on Feb 27, 2011

Looks to me like a pretty heartfelt opinion on the state of affairs at CL, likely by a developer scorned. Wouldn't be the first time this happened.

icey · on Feb 27, 2011

Ugh, I accidentally upvoted this. At least it made me remember how to find the flag button on comments.