Managing Hierarchical Data in MySQL (or really any database)

znbailey · on July 17, 2009

The author fails to mention another very common approach that does not suffer (as badly) from the nested set approach. This strategy is called "materialized path".

In the materialized path strategy, each node/row has its full path in the tree stored in a column and that path is "materialized" or "derived" from a depth-first traversal to that node. In conjunction with the adjacency list approach this could yield much better results in most situations than a nested set model.

Taking the tree used in the article, the paths would look like (using '/' for path delimiter but you can use whatever you like):

  electronics
  electronics/televisions
  electronics/televisions/tube
  electronics/televisions/lcd
  electronics/televisions/plasma
  electronics/portable electronics
  electronics/portable electronics/mp3 players
  electronics/portable electronics/mp3 players/flash
  electronics/portable electronics/cd players
  electronics/portable electronics/2 way radios

Alternatively it would make more sense to base the materialized path off of an immutable field such as the PKey of the record (in the same order as the paths above, from the article):

  1
  1/2
  1/2/4
  1/2/5
  1/2/6
  1/3
  1/3/7
  1/3/7/10
  1/3/8
  1/3/9

This has the following advantages:

  * Node insertion/deletion overhead is low / does not require updating other nodes unlike nested set model
  * Query for subnodes is easy: "select * from tree where path like 'electronics/televisions'". To get only direct subnodes add a parent_id specifier.
  * Getting the full path for a node is easy: it's built into the node itself
  * Node depth is easy: count the number of path "segments"

Downsides:

  * Moving sub-tree requires updating all sub-node paths

Hope this helps someone!

Cheers, -Zach

NyxWulf · on July 17, 2009

Materialized path is a good tool for trees that won't be deeply nested, and for tables that won't grow particularly large. However, because the path is stored in a string, read operations are not particularly efficient. It's a good tool to know, but it doesn't scale as well as Nested Set to very large datasets. That being set, it's much less complicated to implement than Nested Sets and is more efficient than adjacency list for many types of data.

crux_ · on July 17, 2009

I'm not so sure about inefficiency; in particular, most databases that I've seen can use an index for string-prefix queries, which means its performance ought to remain acceptable even for large datasets. (Assumption: subtree queries are the ones you care about making fast.)

Also: inserts are practically free compared to the linked article, where they require updating the left and right numbering for every following node!

Another nice property: if you make entries in the materialized tree column constant-width (e.g. by zero-padding), an alphabetical sort by that column will give you a depth-first dump of the tree -- the exact order you'd like for, say, a comment thread or a table of contents.

When I've implemented materialized paths in the past, I have run into issues with the maximum allowed length of indexable string types (which limits the tree depth), but this was in the long ago late 90s. :) I think it's a very nice albeit imperfect way of storing certain types of trees, especially ones that are mostly insert+query-only.

snprbob86 · on July 17, 2009

How are reads not particularly efficient? If you use rooted paths in your lookup, then even a "LIKE '/a/b/c%'" query for all decedents will effectively utilize the index. I think that this would be good for deeply nested trees also. As Zach implied, the down side of this approach is moving subtrees. Unless you have a very volatile tree structure, this should be perfect for most uses.

NyxWulf · on July 17, 2009

Because B-Tree indexes perform orders of magnitude better on smaller lookup values, like say an integer, than they do on large (and even worse variable length) strings. There are a number of factors that contribute to this, but two big ones are the raw computation time it takes to compare two strings is much larger than the comparison of an integer that matches the register size of the machine. Second, the depth of the B-Tree is dependent on the key size used for the lookup.

As I said above, if you are using the materialized path for the type of problem it's best at solving, the speed differences won't matter so much. But that's primarily because the tree's aren't particularly deeply nested and/or that tree table itself isn't overly large. So in essence you are trading computational complexity for ease of use on smaller sets of data. In many cases that's exactly what's needed. On the other hand, if you will be modeling very large trees, or will have a huge number of them, nested sets are more efficient in terms of encoding and storage, as well as lookups and retrievals. The down side is that nested sets are more complex to setup and work with, and make understanding the structure of your trees more difficult.

IMHO, it's important not to fall into the trap that one technology/tool/solution/data structure will solve all of your problems. It's good to know the pro's and con's of different solutions and which problems they are most efficient at solving.

znbailey · on July 17, 2009

Great point. I meant to mention this in my list of cons but it slipped my mind.

johnrob · on July 17, 2009

I'm not the only person who uses this! I wish I knew this had a name when I proposed it; everybody I was working with thought it was strange. I think it's a good mix of performance and query-ability: you can do quite a bit here using LIKE matches. Of course, the main assumptions are that the dataset isn't huge, and the hierarchy isn't changing often.

turtle4 · on July 17, 2009

Helpful, yes. Thank you.

TweedHeads · on July 17, 2009

Interesting.

For short trees and in-memory management another option would be:

  electronics
  /televisions
  //tube
  //lcd
  //plasma
  /portable electronics
  //mp3 players
  ///flash
  //cd players
  //2 way radios

Where depth levels are represented by slashes only.

Replace with indexes and you have a shorter version.

Let's see what else we can come up with just for fun...

jerf · on July 17, 2009

You seem to be assuming a record order; SQL doesn't really have that. (You can hack stuff together but you're still building on an essentially orderless foundation.) That's a fine text document format, though tabs or spaces would be more traditional in that role.

TweedHeads · on July 17, 2009

record: n int, val varchar2

Manage and sort in-memory and store already ordered when done (no matter if it is physically unordered).

That's why I said, for small trees and in-memory management.

rjurney · on July 17, 2009

Do yourself a HUGE favor, and just use L-Tree for PostgreSQL instead: http://www.postgresql.org/docs/current/static/ltree.html

tom_b · on July 17, 2009

Knowing more than one approach for dealing with hierarchies in relational dbs is a good idea. I've used both nested sets and proprietary approaches (things like CONNECT BY in Oracle).

Thanks for the link to the PostgreSQL ltree, I wasn't aware of it.

Also the article mentions Joe Celko's books, which I recommend to everyone who wants to build up their SQL skills.

rjurney · on July 17, 2009

I agree. I implemented a pre-calculated nested tree, as in the article, based on an existing PHP implementation but in Perl. In benchmarking it, I found that it was faster than I thought to update the entire, large tree - but that the L-Tree was going to be a much better solution overall. L-Tree has that low-level C, Russian hacker edge going for it :) But your mileage may vary.

The nice thing about the pre-calc stuff is that it is very approachable, and you can build it yourself. I found doing so very educational and encourage anyone who is curious to play around.

As someone else mentioned - we also looked at rolling our own materialized path trees - and did an implementation of this as well, but in the end L-Tree 'just works' out of the box, and is wicked fast compared to a higher level implementation... so we went with it. Its great. I'm not a Postgres fanatic or anything - I like MySQL and Postgres, and lots of NoSQL databases too, but Postgres is great at this in particular.

dabeeeenster · on July 17, 2009

This is quite a neat solution, but it has a few detracting points, most notably mass updates when you want to modify the tree.

It also seems slightly irrelevant; surely it's better just to read the structure into memory when the application starts and just work on the data structure in memory?

derobert · on July 17, 2009

Reading the entire structure into memory brings all kinds of problems with it, such as knowing when to throw out your in-memory copy because the database changed. It is also problematic if the tree is large.

Reading the entire tree into memory can be effective if it never changes and is reasonably sized. But if that's the case, why are you storing it in a database? Why not just a data file?

rjurney · on July 17, 2009

In practice... in many applications you can cache parts of the tree in a session, because it won't change in a single login. If its not too big.

jessep · on July 17, 2009

Here's a comparison of two django apps implementing various ways of doing hierarchies: http://www.qompr.com/charts/63%3Bdjango-hierarchical-tree-da...

nradov · on July 18, 2009

What a mess. IBM DB2, Oracle, and MS SQL Server all support native storage of hierachical data (as XML). MySQL needs to catch up.

leej · on July 18, 2009

good article http://www.sitepoint.com/article/hierarchical-data-database/

smithjchris · on July 17, 2009

I've written an implementation of nested sets for at least 3 DB engines over the years (SQL Server, MySQL and PostgreSQL).

Some of that article is plainly dangerous and I wouldn't recommend following it without some extensive research elsewhere. SQL for Smarties is the usual reference.

Firstly they seem to insist on MyISAM which is just crap. I don't need to cite any references on that.

Secondarily, they use LOCK TABLE whilst modifying the data. That basically blocks other writes yet doesn't give any transactional capabilities. Due to the denormalised nature of the nested set implementation, you NEED transactions to ensure operations are ACID compliant or you risk tree corruption.

Thirdly, they do not discuss in detail the performance constraints of nested sets i.e. it is purely a read-optimised. When nested sets start growing in node count, the modification operations do not scale as a significant chunk of the tree needs to be renumbered inside a transaction.

NyxWulf · on July 17, 2009

I've done the same, and I agree with your post entirely. There are a few tricks to making nested sets perform reasonably well on large data sets.

I've tested this model on trees that contain up to 100 million items within the scope of a single tree, and I have a pretty good feel for how well it would perform beyond that.

- The first and most important consideration is what defines the scope of a tree. For example, if you are building a discussion board, and using nested sets (I've done this) you would want each post/reply set to be a separately scoped tree. That decision alone vastly reduces much of the write contention.

- Next, if the tree is going to be heavily used, store the tree in a separate table with each row fixed width and the primary key of the item it refers to. Databases are much more efficient at reads and writes when rows are fixed width and do not contain nullable or variable length data.

- Additionally it's important to maintain the data for the adjacency list within your structure, so that if you ever have errors or corruptions with your tree, you can rebuild it correctly.

- Next if you need to insert some type of sub-tree into a larger tree. You can compute the sub-tree and then insert it into the larger tree in a single operation, instead of doing multiple inserts that lock the entire tree. This is a relatively difficult optimization and something I don't typically build until it becomes necessary.

- Spend some time working through the math of how Nested Sets work. It's relatively simple, and really understanding it will help you implement efficient bulk operations on your trees.

- It's important to get a write lock on the tree before you get your left and right values for the tree. If you get your left and write values before you get exclusive access to the data you are working with, someone else could have done an insert while you were building your tree, and when you do your insert, you will have corrupted the entire tree.

- Finally if you are interested in this topic, I cannot recommend Joe Celko's books highly enough. SQL For Smarties has a thorough discussion of the different models. Additionally he has a separate book that only deals with Hierarchical data models, and he delves further into the various issues and implementation tricks with these data structures.

smithjchris · on July 17, 2009

Some excellent suggestions there. Many thanks :-)