Hierarchies in SQL, Part II, the Sequel

Question

Hierarchies in SQL, Part II, the Sequel

GSquared

SSC Guru

Points: 260824
More actions
August 20, 2012 at 9:55 pm

#260317

Comments posted to this topic are about the item Hierarchies in SQL, Part II, the Sequel
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Shawn Dube SSC Journeyman Points: 98 More actions · Answer 1

Thanks for this article and the original! One question though... have you tried a very deep hierarchy on any of your methods? Like a million node binary tree (not necessarily balanced)? Something where one "leg" of the tree is a few thousand nodes deep?

Just wondering what limitations may creep up.

Greatly appreciate your work!

Michael Meierruth SSChampion Points: 10051 More actions · Answer 2

Shawn,

In the real business world, I have encountered hierarchies around 15 nodes deep. They're fun. Where in the real world (business or not) have you encountered a hierarchy "a few thousand nodes deep" that needs to be dealt with using SQL? I'm very curious.

Shawn Dube SSC Journeyman Points: 98 More actions · Answer 3

Hey Michael,

I get that question a lot! 🙂

It definitely is not a common need but one of the things my company supports are compensation plans for direct marketing and multilevel marketing companies. The specific plan I am referring to is called a Binary plan ( http://en.wikipedia.org/wiki/Binary_plan ).

In these plans, it is common for one leg to grow much faster than the others and I did have one with 1042 nodes in a single leg. rest of tree was approx 400,000 nodes

I imagine a Bill Of Materials for a space shuttle might get large too 😉 (though probably not 1000 parts deep)

I am hoping to have a single solution that works nicely over our common needs (few thousand nodes mostly balanced) as well as these "exception" cases to allow our codebase to be common across clients. This isn't a hard-fast need, would just be nice to not have to switch out a client as they grow past possible thresholds.

Dwain Camps SSC Guru Points: 86978 More actions · Answer 4

Great article Gus!

I haven't had a chance to play around with HiearchyIDs myself yet but I must put it on my agenda to try it out. And your examples are a great place to start an analysis.

I'm also interested to see if there are any variations to the hiearchy traversal methods of adjacency lists that might yield better performance, although I doubt I'm smart enough to figure them out. 🙂

I did have one recent case that I was able to improve though. So it may be possible.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Michael Meierruth SSChampion Points: 10051 More actions · Answer 5

Shawn,

I read up on this MLM stuff.

Very wicked indeed (gets even compared to Ponzi schemes).

I find it difficult to see how a sales person at node level 1042 appears to a sales person at node level 1 especially when it comes to the concept of node level N getting a commission for sales done at node level N+1.

I can see the essence of a bill of materials that makes it necessary to be organized hierarchically. But even in a space shuttle I can't see past 50 levels.

Interesting stuff though...

GSquared SSC Guru Points: 260824 More actions · Answer 6

Shawn Dube (8/20/2012)
Thanks for this article and the original! One question though... have you tried a very deep hierarchy on any of your methods? Like a million node binary tree (not necessarily balanced)? Something where one "leg" of the tree is a few thousand nodes deep?
Just wondering what limitations may creep up.
Greatly appreciate your work!

The deepest real-world hierarchies I'm used to seeing are certain multi-level-marketing companies. Those can get pretty deep.

Adjacency queries turn into nightmares on that kind of data. After all, they have to do a scan for every level, and deep hierarchies are (by their very nature) large tables.

Nested Sets are still nearly instantaneous to query. If it's a lot of rows of data, returning the dataset to the client and rendering it will be your performance issue on those. No matter how deep a Nested Set hierarchy is, querying one is simply a single scan. If you haven't indexed the necessary columns, that might be a problem, but that's not likely to come up since the hierarchy data is likely to be the clustered index on a hierarchy table.

Hierarchy ID works like nested sets. On deep queries, it's still very fast. If you look into the way they get stored and indexed, which is as a binary string, that makes sense. After all, 0x1 and all it's children nodes, 0x102, 0x1F3, and so on, are all in the same range. It's like looking up all the names that start with "A" in the phone book.

Updates are the same as in the examples in the article. Adjacency, you just update one row and you're done. Should be so fast, even on a million-rows deep, that you can't measure the time taken. Nested Sets, you rebuild the table, and that takes lots of time. Hierarchy ID, you have to update every child, so you're doing a million-row update, and that won't be fast and might take a fair chunk of disk space in both tempdb and your log file. It'll be massively faster than a Nested Sets update, and a lot easier to do, but it will be slow compared to Adjacency.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 7

Shawn Dube (8/20/2012)
Hey Michael,
I get that question a lot! 🙂
It definitely is not a common need but one of the things my company supports are compensation plans for direct marketing and multilevel marketing companies. The specific plan I am referring to is called a Binary plan ( http://en.wikipedia.org/wiki/Binary_plan ).
In these plans, it is common for one leg to grow much faster than the others and I did have one with 1042 nodes in a single leg. rest of tree was approx 400,000 nodes
I imagine a Bill Of Materials for a space shuttle might get large too 😉 (though probably not 1000 parts deep)
I am hoping to have a single solution that works nicely over our common needs (few thousand nodes mostly balanced) as well as these "exception" cases to allow our codebase to be common across clients. This isn't a hard-fast need, would just be nice to not have to switch out a client as they grow past possible thresholds.

In the MLM schemes I'm accustomed to reading about (I've not worked for one myself, so can't speak from direct experience), additions and queries are much more common than updates.

In other words, Bob may recruit a bunch more salespeople under him, but the ones already under him aren't likely to suddenly need to be moved under Sam instead. If Joe works under Bob, you're not likely to need to update Joe's position in the hierarchy, and that of everyone beneath him, or at least not frequently.

Does that match your usual business need?

If so, then Hierarchy ID is probably your best bet. Easier to add to than Nested Sets, faster to query than Adjacency, by large margins in both cases.

If you are using Adjacency, then recursion limits might do interesting things to you data/performance as well, but querying Hierarchy ID data isn't recursive. So 1,000+ levels deep is still just a single index range-scan, instead of a scan for each level.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

HLogic SSC-Addicted Points: 465 More actions · Answer 8

For hierarchies of < 50 levels depth, Nested Intervals (Hazel) may be of interest.

GSquared SSC Guru Points: 260824 More actions · Answer 9

HLogic (8/21/2012)
For hierarchies of < 50 levels depth, Nested Intervals (Hazel) may be of interest.

I consider those just a flavor of Nested Sets. It has the one advantage that it doesn't require expanding the range at the top level when you insert at lower levels, while trading that off for complexity of implementation, difficulty in inserting nodes between other nodes at lower levels (requires recomputation of all sub-nodes that are shifted to the right; both version have this issue), difficulty for human data-verification (requires fairly complex math to visually validate the tree composition), and problems with rounding errors towards the bottom of the hierarchy if depth is greater than a certain level (depending on the data type used, this can happen shallower or deeper in the hierarchy).

It retains the query speed of Nested Sets, since it's a single index scan (usually a range scan rather than a full scan) to resolve the most common hierarchy queries. Per my tests (admittedly limited), the HierarchyID datatype has query speed that's in the same range, can handle deeper hierarchies (it can only go 892 bytes deep, but that's a lot of hex values), and is faster and easier to edit.

For hierarchies too deep for HierarchyID, I'd probably go breadcrumb/path, if query speed matters more than storage size, or Nested Sets if the hierarchy is relatively static.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 10

To test maximum depth for HierarchyID, I just ran this test (same table definition as in the article):

SET NOCOUNT ON;

GO

TRUNCATE TABLE dbo.HierarchyTest;

GO

INSERT INTO dbo.HierarchyTest

(NodeName,

ParentID,

RangeStart,

RangeEnd,

HID

)

SELECT

'Person' + RIGHT('0000' + CAST(Number AS VARCHAR(4)), 4),

NULL, NULL, NULL, NULL

FROM dbo.Numbers; -- 10,001 rows of data

GO

DECLARE Cur CURSOR

FOR

SELECT ID

FROM dbo.HierarchyTest

ORDER BY ID FOR UPDATE;

OPEN Cur;

DECLARE @ID INT,

@PID INT,

@HID VARCHAR(MAX);

FETCH NEXT FROM Cur INTO @ID;

BEGIN TRY;

WHILE @@fetch_status = 0

BEGIN

SET @HID = COALESCE(@HID + CAST(@ID AS VARCHAR(MAX)) + '/', '/1/');

UPDATE dbo.HierarchyTest

SET ParentID = @PID,

HID = CAST(@HID AS HIERARCHYID)

WHERE CURRENT OF Cur;

SET @PID = @ID;

FETCH NEXT FROM Cur INTO @ID;

END;

CLOSE Cur;

DEALLOCATE Cur;

END TRY

BEGIN CATCH;

PRINT @ID;

END CATCH;

It'll fail when it reaches the greatest depth it can manage. Got successfully to 427 in this test.

If the node values were compressed, instead of ascending base-10 numbers, you could obviously squeeze more depth out of it, but I don't have time to try that right now. Convert the base-10 to base-36, for example, and you'd get a lot more depth, since it's dependent on the length of the string that's being converted. I haven't tried that yet, so it could just crash and burn. Probably worth a test, though.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Robert.Sterbal SSCrazy Points: 2855 More actions · Answer 11

Is anyone familiar with the application of HeirarchyID to genetic sequencing?

GSquared SSC Guru Points: 260824 More actions · Answer 12

Robert.Sterbal (8/27/2012)
Is anyone familiar with the application of HeirarchyID to genetic sequencing?

I have to admit I don't know enough about genetic sequencing to even be aware there's a relationship between the two things. HierarchyID could easily be used for a family tree (to a certain depth), but is genetic sequencing a hierarchical data structure? If so, it probably goes past the depth HierarchyID can readily do, but that's just a WAG on my part.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Robert.Sterbal SSCrazy Points: 2855 More actions · Answer 13

It seems like gene sequences often go in the thousands in lengths, and any change could be a mutation. 90% of all are genes are um... not used in encoding.

The other part of this is that the genes (think of them as developer code) then spawn protein building (think of that as running code in production)

GSquared SSC Guru Points: 260824 More actions · Answer 14

Robert.Sterbal (8/27/2012)
It seems like gene sequences often go in the thousands in lengths, and any change could be a mutation. 90% of all are genes are um... not used in encoding.
The other part of this is that the genes (think of them as developer code) then spawn protein building (think of that as running code in production)

I have a basic understanding of DNA/RNA, expression, and protein synthesis. Know the difference between myosis and mitosis (though I'm not actually sure I spelled either one correctly, I know what they are). But that's not enough to have any advice on coding and database options with regard to them.

I'd probably use a breadcrumb hierarchy of one sort or another, just based on the idea of what-spawns-what. HierarchyID is essentially a breadcrumb, but encoded for use by some specific methods. If its depth limitations will work for the kind of data you're talking about, then it should be fine. But sequencing thousands of items, assumed to be AT/GC bonds, would mean a hierarchy thousands of levels deep, and the limits on the SQL Server HierarchyID datatype wouldn't allow that. Doesn't pair-1 of human chromosomes have hundreds of millions of pairs, all by itself?

But, again, I'm not even sure that mapping AT/GC sequences is what "sequencing" means. I assume so, and that it would take 1 hierachy node per bond-pair, and those are assumptions I'm not qualified to make.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon