Does anyone in their work find that they are able to employ data structures like...

sempron64 · on April 29, 2019

When working on a high volume web application I found data sketches (specifically the count-min sketch) very useful for finding and logging unique query patterns (Check if we've seen it before, how often have we seen it, if it's new, log it). Using a sketch bounded how much memory we used and eliminated database trips. In theory a hashmap data structure might have worked as well, but its size is unbounded given too many unique queries, while all we needed was an estimate.

Probabilistic data structures like bloom filters and sketches are really useful for gathering statistics on data sets that are too large to manage or would have a negative performance impact by having an extra database trip. So something to do with internal application diagnostics/debugging/logging is a great low-risk place to use these sorts of algorithms even when it makes sense to keep all business logic in the database.

jcranmer · on April 29, 2019

There's a general class of problems to be solved with data structures, which is that you have a collection of records of data, and you're trying to find [1] a record, or collection of records, given some criteria. In essence, you have a database of some kind, so it should not surprise you to learn that databases (and things that are databases but not normally thought of as such--filesystems, for example) rely very heavily on these more "exotic" data structures internally. Indeed, a database is basically a heavily-optimized data structure that usually optimizes for the bursty nature of disk access [2]. And given that implementing these data structures tends to be a rich source of bugs, if you can offload the work to a database implementation, you probably should.

I will also point out that when you view the problem of data structures as an implementation that solves various queries, you can start building tools that automate the implementation of complex data structures given the queries you want to ask the data structure. I fully expect to see these tools start to become available over the next decade or so, given the current progress of program synthesis I see in the academic literature.

[1] Or insert, delete, or update. But usually you already have the record at that point, and you just want to update the data structure's representation of that record.

[2] You can apply the same techniques to cache access and even vector register usage for purely in-memory data structures. Same problem, just with different sizes of disparity.

sacheendra · on April 30, 2019

Stratos Idreis's team has been working on exactly this. Automatic design of data structures based on workload.

Their work has been on HN before: https://news.ycombinator.com/item?id=18314555

jcranmer · on April 30, 2019

I was specifically thinking of https://homes.cs.washington.edu/~mernst/pubs/collection-synt..., but mostly because that's the paper I actually read.

vram22 · on April 29, 2019

>I fully expect to see these tools start to become available over the next decade or so, given the current progress of program synthesis I see in the academic literature.

Interesting. I did know about program verification efforts (trying to prove that programs are correct), but had never heard of program synthesis (assuming it is somewhat different from code generation). Do you know of any examples of the kinds of areas it can be applied to?

jcranmer · on April 30, 2019

The best overview of the topic I know of is https://www.microsoft.com/en-us/research/wp-content/uploads/..., which describes the state of the field ~18 months ago.

The major field where it's already being applied is in automating the conversion of data--building the program that can take "John Doe" and turn it into "<name>John</name> <surname>Doe</surname>" just by example. You can also do similar generalization to automate things like constructing highly repetitive graphics (e.g., draw a few teeth of a gear and the program does the rest). Superoptimization is the other main prong, where you really combine the synthesis and the verification to get correct faster code.

vram22 · on April 30, 2019

Those applications sound cool. Will check out that paper. Thanks.

petschge · on April 29, 2019

I do plasma simulations and recently had the problem of finding the distance to the nearest neighbor for every of the particles in the simulation. Doing that naively is O(n^2) and took hours even for small test problems. Building an R-tree once and using if for nearest-neighbor look-ups brought that down to 5 minutes.

libspatialindex lacks documentation, but worked really nicely. The rtree interface in python is much friendlier.

arman_ashrafian · on April 29, 2019

I’m taking Advanced Data Structures at UCSD right now and our first assignment was making a K-D Tree and an efficient KNN Classifier. It was surprisingly simple and the efficiency between the KD Tree and brute force implementation was quite drastic.

If you only build the tree once and do no insertions what is the benefit of an R-Tree vs KDTree?

petschge · on April 29, 2019

I actually do plan to update the tree as I insert additional particles in locations where the distance to the nearest neighbor is large.

BubRoss · on April 30, 2019

There is a library called flann that does approximate nearest neighbors using a kd-tree and best bin first method. It is typically used not only for nearest neighbors but finding them in higher dimensions.

igotsideas · on April 29, 2019

Plasma simulation sounds really damn cool.

wolfram74 · on April 29, 2019

It's actually usually pretty hot. </pun>

amelius · on April 29, 2019

I'm curious, how many particles do your simulations typically contain?

petschge · on April 29, 2019

Typical full sized simulation runs contain between 100 million (1e8) and 100 billion (1e11) particles. Small tests while debugging might be as small as a few thousand particles (1e4). And tests of the code might only use one or two particles.

The thing I was working on when I switched from naive O(n^2) to R-trees had half a million (5e5) particles.

robmaister · on April 29, 2019

Gamedev, but even then actually implementing the data structures is rare. And nothing outside of geometric/parallel data structures listed towards the end.

I've had to use a concurrent priority queue once (locking a regular priority queue was too slow, dropped in a concurrent implementation), implemented kd trees a few times, and I've used half-edge once (a cousin of quadedges). Mainly doing graphics work.

Often the naive data structures work better because there's less indirection and it's easier to vectorize. In my case, depending on how much GPU budget you have available, you can also just scrap the CPU implementation and throw it in a compute shader.

01100011 · on April 29, 2019

I think the reality is that >80% of SWEs will never need any of this, because most SWEs are high-level integrators and consumers of these algorithms via libraries. For a lot of people, their time is better spent learning how to architect and use available tools to quickly solve problems while writing bug-free code.

But then life changes and you may find yourself in need of this knowledge. I used to laugh at graph algorithm questions on interviews because I never directly used a graph in 20 years of SWE. Then I got a job where I am on a team maintaining a graph-based API. Jokes on me, now those 'silly' algorithms are very relevant.

Frost1x · on April 30, 2019

That means the graph questions were only really relevant for the position where you maintain a graph based API. They were completely irrelevant for the other positions.

My guess is, even after working with graph algorithms for awhile, you still needed more than 20-30 minutes to make implementation changes and still needed to have a reference to look at while doing so.

I know for a fact graphs are quite useful. I also know complex algorithmic questions during an interview for most cases are also not useful. The fact you obtained the position and were able to adapt and still manage graph structures shows that those questions didn't gauge your fundamental abilities. Maybe you didn't hit the ground running day one, but after a little bit of learning/practice, you were fine under reasonable delivery conditions.

jwr · on April 30, 2019

I managed to use DAWGs (Directed Acyclic Word Graphs) in a search engine, and later HyperLogLog+ and a Bloom Filter for a scalable distributed Multi-Armed Bandit implementation.

Otherwise I find that the built-in Clojure data structures are so good for all-round use that it's difficult to justify the time to use anything else, except in extreme cases.

vram22 · on April 30, 2019

>Otherwise I find that the built-in Clojure data structures are so good for all-round use that it's difficult to justify the time to use anything else, except in extreme cases.

I don't know Clojure, so can't compare it's data structures with those of Python, but I've read a few times that Python's built-in data structures are so useful that for many applications, it suffices to just use them as is, or in combinations (e.g. lists of dicts, dicts of lists, ...), instead defining your own custom ones using classes. More so when used in combination with other Python features like {list,dict,set} comprehensions and libraries like itertools.

And I've found that to be somewhat true in my own Python work.

atombender · on April 29, 2019

I recently had to store large numbers of geospatial points (lat/lng pairs with data) in RAM for querying. I used a basic quad tree, which is similar to R-trees, but much simpler. Getting a subset of the points based on a bounding box is very fast and simple.

chrisseaton · on April 29, 2019

You've answered your own question - the person writing the database you're using, for example.

segmondy · on April 29, 2019

computing is so cheap that most people can delegate to a database. at some point tho, if money is a problem and you don't have the computing resources, you can squeeze more out of that hardware by using more specialized DBs, (graph, doc, column), something like redis, etc, but then at some point those might not map very nice to your problem and in which case custom data structures will get you far. Hence the reason FAANGs are big on algorithm & DS questions during interviews.

urbanslug · on April 29, 2019

I'm a Bioinformatics student, so it's not for work, but I have been looking into Suffix Arrays, BWTs and FM indexes for my project.

I've seen them applied to some real world next gen genomic tools for example this one https://github.com/vgteam/odgi implements a multi string BWT. Technically they're in academia and not making money afaik but it's interesting stuff.

winrid · on April 29, 2019

Yes, but not often!

Recently used nested R-Trees for in memory geospacial querying for a game (millions of atomic updates a second which would be a pain to manage sharding as items move around the world).

For example as a tree gets too big I split it at the hot spot and run the job processing for each tree on separate threads.

Few years ago implimented a simple tree for a survey product I built. The pathing in the survey is stored in the tree and I use the tree to find loops (which are invalid) in the survey.

holy_city · on April 29, 2019

Pro multimedia software here. We deal with fundamentally concurrent architectures with a variety of not-so-soft realtime requirements. So thread safety, lock/wait free guarantees, and cache performance are my key concerns.

Most off-the-shelf data structures are poorly suited for that environment so I need to roll my own or modify existing structures. I'm no data structure expert, so this kind of stuff is really helpful.

Groxx · on April 30, 2019

Though my use of stranger data structures is relatively rare in my line of work, I enjoy learning them because they keep revealing new ways to approach problems. It's less "I use CRDTs" and more "CRDTs have changed what I think is feasible", similarly for stuff like bloom filters (no-false-negatives can be very powerful) and locality-sensitive hashing (don't search for all X every time, pre-compute a simpler version so you only check promising items). Concurrent data structures have also been pretty fruitful, since they're also often "how can we guarantee X in our distributed system, and what primitives do we need" strategies / solutions.

(obviously some of this is "I learned a tangential thing while learning about X", but that's kinda the point. you may never use a rope, but you might be introduced to a new idea in the process of learning about it)

kevinventullo · on April 29, 2019

It's hard to beat an R-Tree for doing in-memory geospatial queries.

vsef · on April 29, 2019

Work in speech recognition, from this list use lots of advanced automata algorithms, bloomier filters, interesting types of hashing, fancy priority queues.

oplav · on April 29, 2019

It's not as interesting as some of the structures in the PDF, but we are working on a problem where we have lots of timestamped data and need to quickly and frequently access the nearest data point from a given input timestamp. We ended up using a tree-like structure but that said, we ended up using whatever tree-like structure was available in the language.

tmh79 · on April 29, 2019

I work on a mapping team. We implemented our own R tree for spatial search as existing libraries had some unoptimized hot paths for our use case.

ivalm · on April 30, 2019

FM-index is super useful for large string collections.