Hacker News new | past | comments | ask | show | jobs | submit login
Roaring bitmaps: what they are and how they work (vikramoberoi.com)
226 points by voberoi on Sept 3, 2022 | hide | past | favorite | 60 comments



I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics. Often they involve specific knowledge of the hardware instead of treating software as abstract. To me this is the big difference between “computer science” and “computer engineering”.


Much of this is just threshold effects for algorithm performance, not special cases per se.

You don't even need to have specific knowledge of the hardware as long as you can identify cross-over points where one algorithm starts significantly outperforming others. An old HPC trick is to write software that thoroughly measures several algorithm strategies on your specific hardware environment and then code-gens an algorithm that has an optimal set of strategies and strategy-switching thresholds. The meta-algorithm is mostly focused on cheaply detecting these thresholds at runtime.


For a lot of such algorithms there are also pretty cheap ways to remove jitter from the measurements (in case you're operating on top of a general purpose OS or something) -- you expect for many algorithms that in the absence of external factors the runtime is convex monotonic as a multivariate function of the input sizes and that external confounders don't speed things up. Under that assumption you can measure the runtime for many different input sizes and de-jitter each data point by pushing down that convexity constraint, even if for each input size you only gather one single noisy data point.

Where possible, I always liked arrays of function pointers for each next power of 2 in the input sizes. A few popcounts and a jump later, and you're executing the right routine. It's not great if you're throwing a general-purpose tool against tiny problems, but it's not a huge amount of overhead either, and it's dead simple to code.


Can you give an example of this convex assumption push down? It doesn’t seem right to me. For instance, due to cache line sizes and other blocking/buffering effects throughout hardware and the OS, I actually would expect “staircase” functions which aren’t convex.

Maybe that’s convex if you smooth enough. But depends on the buffer sizes.


In 2000 came across a binary space partition algo that did this, but used a small scheme interpreter to codegen the c++ BSP code. Would love to find out the library name... have now forgotten it.


I had the same takeaway when I read these papers. It felt like a game of whack-a-mole.

There's a massive performance benefit to doing this at the cost of implementation complexity. I haven't studied the implementations or tried my hand one, but I get the impression that these are tough to implement correctly in a way that takes full advantage of the hardware.

(In that sense, it's awesome that the researchers also did the legwork to implement and maintain a library!)


>I see this common theme among very fast practical algorithms (like timsort or pdqsort) where there is not some secret math or algorithm, rather they are just a bunch of special cases based on heuristics

The core here, which is bitmap storage and the basic optimization are simple, but solid and general. Mathematically it takes less space to store data as positions on a bitmap rather than fully spelt associations, and it takes even less space to seggregate the bitmaps in chunks.

This will hold and be smaller and faster in any computer. So, it's not some case of special case based on heuristics, either related to the specific frequencies or sample characteristics of some particular set of data, or of specific CPU peculiarities or whatever.

More exotic optimizations piled on top, sure. But "compressed bitmaps" themselves as a concept, not.


No this is still comp sci. Comp sci doesn’t mean that practical considerations in algorithm design are ignored. Computer engineering has to do with the actual design and construction of computing machinery. It’s an accredited engineering field and the undergraduate coursework includes electronics, semiconductors, computer architecture and generally a lot of overlap with electrical engineering. I had to take one dumb programming course - all the data structures and algorithms stuff I had to learn elsewhere.


timsort and pdqsort seem to be significantly slower than radix sort for random inputs, but presumably radix sort could also benefit from a pile of pattern-finding heuristics.


timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability.

Given a large number of 32-bit integers, radix sort is indeed significantly faster.

While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order.

Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements.


> timsort and pdqsort (two comparison sort methods) and radix sort (a non-comparative sorting algorithm) have different albeit overlapping domains of applicability.

Most comparators are of the form "compare by this, then if tied, compare by that, then if tied, compare by the other thing" which is pretty well suited to radix sort. You are correct though.

> While I would reach for a comparison sort method if I had a large number of arbitrary-length Unicode strings, which I wanted to sort in a case-ignoring order.

It seems like the radix sort is likely to be a lot faster for this too, mainly due to cache effects. If you have a dataset in mind I'll be happy to give it a shot.

> Also, I found timsort faster than radix sort when there was a small number (as I recall, <100 or so) of elements.

For sure. 100 isn't so far from the threshold where a radix sort should fall back to something else anyway.


> which is pretty well suited to radix sort

Yes, the general approach is to convert the input data into a fixed-length bit-string with the same sort order as the input.

Your example assumes that construction overhead is short. If tie-breaking is rare, and breaking the tie requires an expensive operation, then the trade-off point for radix might be much higher than 100 elements.

The fixed-length requirement works well for small items with relatively equal-length fields. Ragged items, like Wikipedia titles, causes a problem. There is one title which is 253 bytes long. Now, Wikipedia titles are limited to 255 bytes, so radix is certainly directly applicable, but 1) it changes the trade-off point, and 2) reduces cache effects.

Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch).


This is basically all incorrect. Maybe you are writing about LSB radix sort? I am writing about MSB radix sort. You can get some off-the-shelf MSB radix sort for sorting strings of arbitrary lengths here[0]. While this repository is mainly about parallel string sorts, I've found a version of this MSB radix sort that does not perform a bunch of unnecessary copies and caches more bytes at a time[1] to be competitive with the parallel sorts in the repo.

In general MSB radix sort will have to look at the same parts of the input elements as multi-key quick sort, but one hopes that it gets to make fewer passes over the array. A comparator-based sort would look at about the same parts of the input elements as well, but it would look at them many times more than necessary.

> Finally, it requires a sort-order-preserving transformation. I mentioned case-insensitive collation of Unicode strings as a well-known difficult problem. I do not believe there is mapping to an order-preserving representation which can be done bitwise. At the very least, it will be difficult to support all of the collation styles that currently exist (eg, French collation is different than Dutch).

The requirements are a bit underspecified, but I think these can be solved by unicode normalization + tolower + codepoint-wise comparison, which is probably what you'd do in your comparator for a comparison-based sort as well.

[0]: https://github.com/bingmann/parallel-string-sorting/blob/mas...

[1]: https://github.com/dendibakh/perf-challenge6/blob/Solution_R...


> for sorting strings of arbitrary lengths here

Ahh, I see. It only loads CACHED_BYTES at a time. This could be used to delay the tie-breaking loads until needed, with some sort of on-the-fly encoding, and potentially skip those loads completely.

I did not consider that. Thanks!

You are right - my experience is with LSB radix sorts, most specifically the one in NumPy. I see it references https://github.com/eloj/radix-sorting#-key-derivation which says "an LSB implemention is not well suited to sort character strings of any significant length. For those, a MSB radix sort is better" and in turn links to an implementation which includes benchmarks https://github.com/eloj/radix-sorting#cpp-benchmark . It in turn cites Bingmann.

Another thanks for pointing out a missing part of my understanding.

The radix approach still requires a mapping to a byte representation which is order-preserving.

> but I think these can be solved by unicode normalization + tolower + codepoint-wise comparison, which is probably what you'd do in your comparator for a comparison-based sort as well

tolower() doesn't work for all collations. Quoting https://unicode.org/reports/tr10/ "Some dictionaries and authors collate uppercase before lowercase while others use the reverse ... Sometimes the case ordering is mandated by the government, as in Denmark."

codepoint-wise comparison doesn't work for all collations. Quoting the same source, "There are additional complications in certain languages, where the comparison is context sensitive and depends on more than just single characters compared directly against one another" as well as "In some French dictionary ordering traditions, however, it is the last accent difference that determines the order".


This is an excellent write up on Roaring Bitmap. The concrete examples really help to illustrate the algorithm well.

While Roaring Bitmap performs well on space saving with good performance, my gut feeling is there're better ways to achieve the same goals, especially it involving sorting on insertion and binary search on lookup, both on the sorted top level list and the sorted lower-bit integers.

Also it works well on 32-bit integers and probably 64-bit integers but it's not clear it can scale well beyond to bigger integers.

Nevertheless it's an excellent algorithm that works well in practice.


Not .bmp bitmaps used to store photographic images as I would have guessed:

> Bitmaps are arrays of bits used to store sets of integers. They work by setting the Nth bit when an integer N is in the set


I was confused until I clicked on the provided link in the article to see they’re talking about bit maps. For me that space has always been critical.


> When a set is sparse, traditional bitmaps compress poorly.

They waste space. But if you were to compress them, they'd compress very well. As in, for example, if you were to compress them using roaring bitmaps.


If you compressed them using some naive scheme, they would be smaller but they would lose their fast set-wise operations like lookup and union.


Maybe the author means they compress poorly under generic, common compression algorithms. This is an exotic compression.


Yeah pretty much. General purpose compression algorithms do a good job but pale in comparison to what Roaring achieves.

I wouldn't use the word exotic but rather specialized and/or optimized.

Infact later versions of Roaring use run-length-encoding (RLE) containers in addition to array and bitmap containers which is a technique shared with more traditional/general compression algorithms.

In effect rather than achieving "very good" compression, Roaring is very close to "optimal" compression for sparse bitmaps.


I thought the trick was going to be indirection.

sorted-list/bitmap/runs, in a two level tree. cool.

it's technically missing sorted compliment-list, i.e. only the items that are missing, although worst case runs only use twice as much space, so more complexity without much savings, esp. considering real workloads

performs better with sequential ids than random uuids because you use fewer pages

further research

investigating the effect of adding more layers to the tree

using additional set representations e.g. balanced tree


Lucene's adaptation of Roaring uses the complement idea on a block-wise basis:

https://github.com/apache/lucene/blob/84cae4f27cfd3feb3bb42d...


Those who do not know judy1 are doomed to reinvent it.

http://judy.sourceforge.net/


which possibly is a good thing

the Judy project seems inactive to me:

https://sourceforge.net/p/judy/bugs/

this short HN thread resonates with the intuition of jude1 being stalled, it's last optimistic comment points to an orphaned comment in the big list above

https://news.ycombinator.com/item?id=32188204


It's certainly not active but perhaps could be revived now that the patents have expired (filed in 2001 so expired last year).

On the surface it's just a very optimized 256-ary radix trie. I think it would take some software archeology to determine if there was more to it than that and if it's assumptions still hold on todays processors.


Does it need to be active? It's done and it works.

It's not written in an highly unstable language. The safety of every bridge you drive over was probably partially validated using fortran numerical code from the 1980s.


https://github.com/Reference-LAPACK/lapack/commits/master

said fortran code happily gets minor and major updates


I wasn't actually referring to LAPACK but other FEM code. :)


Point being: a project needs updates to remain useful in contemporary contexts. code that happily built -Wall, pedantic, error... in April 2002 may fail miserably in today's tool chain.

plus it may not benefit from contemporary advances in silicon architecture, etc.


Judy arrays have never been particularly compelling, but for some reason gained huge mindshare among programmers on Slashdot.

Today it'd make much more sense to implement ART, which is dramatically simpler.


As someone with a degree in contemporary arts who switched to programming I humbly disagree with this statement.

Jokes aside, it took me quite a bit of trial and error to figure out that you are (probably) referring to Adaptive Radix Trees. Because "art tree" doesn't exactly give useful results.



Not applicable to this subject. :) A comparison between judy1 and roaring would be interesting.


Bitmaps are amazing and we believe that they will replace all other data formats for analytics.

Today we open sourced about $25m in R&D (more to come)! They take a bit of harnessing, but once you do the results are amazing. We have released an OLAP database called FeatureBase, built entirely on bitmaps (well bitmaps in a b-tree)...

Here is a bit (pun intended) more about our big bet on bitmaps: https://www.featurebase.com/blog/bitmaps-making-real-time-an...


Can someone explain how this is laid out in memory and what it would look like serialized? It's easy to have an array of N size that fits all bits and just write to disk. You can also mmap a file as a large array of bytes and just mutate it and let the OS handle the disk syncs if you're okay with some data loss of recent state changes if the machine crashes.

What would an array of pointers to X containers which are N size each look like on disk?


The containers are not in a contiguous memory block. To serialize it, you could write out each container as a length prefixed block with the container id, which is the 16-bit MSB of the container. The top level sorted list can be ignored, as it can be reconstructed.

To de-serialize, read each container by reading its prefix length and its container id, and then read the rest of the container by the length. The top level list is sorted by the container id. It can be reconstructed by inserting the container id with the container memory pointer into the sorted list.


Just to add a bit. The top level list can be appended to the end of the file as a directory of the pairs of id and the file offset of the container. In this way, the list at the end of the file can be read into memory to form a mapping of container-id to file offset, which can provide random access to the container blocks in the file.


This article has a section titled "how Roaring bitmaps are represented in memory". I suggest you actually read the article, it will likely answer all your questions.


I was actually asking how they were serialized and stored on disk.


The on-disk format is documented in the spec: https://github.com/RoaringBitmap/RoaringFormatSpec

It's very simple actually. I worked on adding run container support which involved adding support for the newer serialization format in roaring-rs and it proved quite elegant.


I enjoyed this walkthrough. I'm interested in RB because it's used in Lucene but never dug in and assumed (without much thought) that it had to do with consecutive runs of 1s and 0s. Wrong!


Compressing consecutive 1's and 0's was the first idea that popped into my head as well. Then when they started to talk about inserting into the 800,000th position my brain went "What if they just reversed the array and stored some metadata as the type of array they're dealing with?"

It's funny that in the end Bitmaps essentially are a modified data structure and process.

The way things are going, I imagine someone will at some point take all the different tricks we have of inserting, sorting, finding, deleting data, take millions of datasets and run some machine learning type process to create libraries that can perform these operations optimally.


I was thinking a similar thought that 4096 seemed quite magic (although it seems to be chosen based on research) and that RB perf could probably be tuned based on the workload


I don't think 4096 is arbitrary. It's an array of 16bit integers, and 4096 * 16 = 65536. So 4096 represents the boundary point below which an array of integers uses less memory than a traditional bitmap.


Oops, thanks for pointing that out


It looks a bit too simplistic: Blocks with up to 4k items are stored as sorted lists. Having to insert in a sorted list up to 4k items seems very expensive to me.


CPUs have a strong performance bias toward sequential memory access and there are large threshold effects at work here. The block size used is not arbitrary. Improvements in prefetching and cache line utilization can have such large performance benefits that it more than justifies any apparent increase in computational cost because of how the code is organized to obtain those improvements.

Most developers do not have a good intuition for the efficiency and scalability of sequential brute-force on modern architectures. There is a cross-over threshold but I think many people would be surprised at how high it is in practice.


I would have expected 2048 items to be the optimal cutover point, because with 16-bit entries that would result in 4KB arrays. Those fit exactly into a typical CPU memory page, whereas 8KB requires two. In the worst case that might double the page table overheads, e.g.: for random point reads.

I wonder if it would be worth the trouble to code-gen a bunch of variants with things like 8-bit entries, and benchmark to death to determine the optimal cutover points...


I think if you go down this road, you'll start to see differences depending on the brand and model of CPU you use. Essentially, you're going into the territory of FFTW's "planner" scheme.


The threshold was chosen empirically. Modern processors are very fast when operating on memory sequentially, as the prefetcher achieves the best case scenario. The cost of a cache line miss is so substantial this ends up dominating things until "n" is in the 1000s. A lot of programmers' intuition is off the mark on this point.


My intuition was indeed trained a long time ago. In terms of memory use, the current choice is likely optimal. But I have a hard time believing that doing an or operation on two sets of sorted numbers will be as fast as the same for two bitmaps of 8k. The latter can be trivially simd accellerated, and could be done in-place.

Having said that, I applaud that this type of datastructure is being investigated. An efficient bitset should be present in every collections library.


The full arrays do get expensive, although not too bad. I work at FeatureBase and we have a whole analytics DB built on a roaring variant... for perf reasons it's usually worth it to bias toward the bitmap representation when you get past about 2k set bits, though it does take a bit more space.


The 4k items are only 512 bytes though, so expensive yes but not overly so.


No. The person you are replying to is talking about the sparse leaf case where the contained integers are actually being stored (to be more precise the low bits are being stored) in a sorted structure. The switch to the bitmap occurs when the sparse structure takes the same number of bits to directly encode the sparsely contained items. In this case they use a 2^16 bitmap and encode sparse elements using 16-bits, so it occurs at ((2^16 / 16) == 2^12) elements.


They are two byte per item; it's the bitmaps (>4k entries) that use 1 bit per item.


I love roaringbitmaps and maybe even over use this library. On the other hand it made a lot of things possible for legacy.uniprot.org over the years that would have been unaffordable otherwise. Quick faceting using permanent filters over 500+ million Documents would have been really hard otherwise.


I’m really curious about this. Are you able to share a bit about your use case and how you use roaring bitmaps outright (vs. through search infra like Solr or ElasticSearch)?


Well the replacement of this code base uses Solr. But when legacy started we where on lucene 1x that over the years upgraded to lucene 9x. So one of the things we used our bitmaps for was for a number extensions to pure lucene. So like facets, but also inter index joins. And then preserving résultats for fast faceting that remained part of the query.

E.g. a thing that legacy can but the new www (Solr) currently can't is allow downloads in a streaming fashion of more than 5million documents. As maintaing a roaringbitmap cost very little memory but depends on docid to key/value store mapping. Our extensions allowed this and gave us a very easy to use rest API.


I'm reading this on a runaway with terrible internet and the figures slowly load to reveal... text :/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: