Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

Question

Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

Viewing 15 posts - 16 through 30 (of 99 total)

You must be logged in to reply to this topic. Login to reply

cs_troyk SSCertifiable Points: 5383 More actions · Answer 1

Excellent article, Jeff! I look forward to Thursday's installment.

I've played around with mixing these two representations of hierarchies before, but had never gotten past the performance issues, so it's great to have a new approach to try. It seems that I keep inheriting db implementations that contain at least one hierarchy in them, so having more choices for dealing with them is a very good thing!

For those that are willing to give up a little in terms of understandability (which can be dealt with via abstractions, anyway), and who are working with fairly constrained hierarchy sizes, take some time to investigate the matrix path encoding technique that Tropashko describes. You can think of it as a "best of both worlds" approach to the problem that doesn't have the heavy recalculation requirements. I was also able to obtain favorable performance results for many of the descendant- and ancestor-type queries. If nothing else, it will expand your db math skillset:-D

And finally, for those with some disk space to burn, you might consider materializing the transitive closure of the hierarchy. This can really net you some performance if you have a large hierarchy that is used primarily for reads (the materialization is a rather expensive process).

-TroyK

Igor Micev SSC-Dedicated Points: 33110 More actions · Answer 2

Hi Jeff,

Simply brilliant!

Thanks

IgorMi

Igor Micev,My blog: www.igormicev.com

Michael Meierruth SSChampion Points: 10051 More actions · Answer 3

Jeff,

In the code for generating the tally table, you use this syntax:

row_number() over (order by (select null))

I have never see this before. But it works. What is this '(select null)' trick all about?

PS

Very, very neat article!

jimbobmcgee Ten Centuries Points: 1108 More actions · Answer 4

Michael Meierruth (11/13/2012)
What is this '(select null)' trick all about?

It's a trick to 'order' by constant value, i.e. to not order/sort at all; in this case it adds a row number number based on the natural order that the processor hit the rows, rather than pre-sorting by a given column(set). You can't just do OVER(ORDER BY 1), because it interprets it as attempting to use the first column in the select, which OVER() won't let you do, but it does seem to let you get away with OVER(ORDER BY (SELECT 1)) [or, in this case, OVER(ORDER BY (SELECT NULL))].

I can't see in the BOL syntax tree where a select is possible (bar some vague reference to an 'expression') but it's been stable enough for me in the past. Maybe someone with a keener understanding of the nuances of the BOL can spot the bit that makes it a 'documented feature'.

In any case, I have often wondered if it might be equivalent to `OVER (ORDER BY the_clustered_index_col)` (at least for a plain table), or whether that would incur more reads. I have never bothered to try because OVER(ORDER BY (SELECT NULL)) always worked well enough for me.

SQLRNNR SSC Guru Points: 281344 More actions · Answer 5

Nice job Jeff.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Cyclone Mr or Mrs. 500 Points: 590 More actions · Answer 6

Hi ...

This may be a silly question, but how does one use the dbo.Hierarchy table ?

Thanks.

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 7

CELKO (11/13/2012)
This was really good!

Thanks, Joe. I appreciate you stopping by especially with the code to convert Nested Sets to an Adjaceny List.

I would drop the needlessly proprietary code (ISNULL? We have COALESCE today, CAST not CONVERT, AS not = for aliases, etc).

I appreciate your thoughts on that but as soon as you use a variable (or just about anything else of particular good use), you lose portability no matter which database engine you're writing code for. Besides true portability being a myth, I've never believed in trying to make code truly portable because you can miss out on a lot of the horsepower of some of the extensions to SQL. For example, I used the proprietary ISNULL because it's a bit faster than COALESCE for several reasons. I also intentionally used the proprietary UPDATE to avoid doing multiple table scans during multiple updates to do what could be done in a single scan to create the Left and Right Bowers. Most other engines won't allow you to update a variable and a column at the same time and that's a real shame but I'm still going to use the technique in SQL Server because it can be used to easily avoid multiple and, therefor, unnecessary table scans.

I also use the alias = expression form of assignment for several reasons... it's the same form as the UPDATE statement in virtually any SQL engine and it makes the code a whole lot easier to read because the all-important column alias will always start at character 9 in my formatted code. No need to look for it at the ragged-right end of multiple length expressions to find the target.

To wit, even you write non-portable code. For example, the very first piece of code you offer in your fine book titled "Trees and Hierarchies in SQL for Smarties" (1st edition) has the following code on page 6.

DECLARE [font="Arial Black"]ARRAY [/font]GraphList

[font="Arial Black"]OF RECORD [/font][edge CHAR(1), in_node INTEGER, out_node INTEGER];

ANSI or not, that's not truly portable code because it doesn't port to SQL Server. ANSI means nothing when it comes to portability except maybe for simple CRUD procs (which are normally taken care of by various ORMs now-a-days) because no database engine is 100% ANSI compliant even on such simple things as how to declare a variable. I rue the day that they ever are because there would be no competition to improve. 😉 In fact, ANSI has adopted many things as “ANSI Standard” that were once proprietary extensions to various dialects of SQL.

Heh… and let’s see you port a trigger between Oracle and SQL Server with no changes. Now THAT’s some serious “fun”!

The same is true on the Oracle side of the fence (for example). If someone were to try something similar to this article in Oracle, I'd tell them not to bother because it's nearly as (or maybe even more so) efficient to use CONNECT BY for such things. If it turned out that I could get even more speed using something proprietary to Oracle, I'd use that instead because we're not playing around with 10 or even 10 thousand rows. When you work with millions of rows, every microsecond you can save per "function" per row adds up in a hurry.

Like I said, I appreciate your thoughts on the subject but will continue to use and advertise the power of the SQL extensions for whatever database engine I happen to be working in to get the best performance possible. Besides, I've only had to port code once since 1995 and it just isn't that difficult to do. It's just not worth crippling the performance of code by avoiding the proprietary extensions just on the very remote chance that you might have to port code someday.

BWAAA-HAAAA!!! Imagine it if someone told you you couldn't use the code from the previously cited page 6 because it wouldn't port to SQL Server. That would be a fun moment to watch! :w00t:

Got features? Use them! 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 8

cs_troyk (11/13/2012)
Excellent article, Jeff! I look forward to Thursday's installment.
I've played around with mixing these two representations of hierarchies before, but had never gotten past the performance issues, so it's great to have a new approach to try. It seems that I keep inheriting db implementations that contain at least one hierarchy in them, so having more choices for dealing with them is a very good thing!
For those that are willing to give up a little in terms of understandability (which can be dealt with via abstractions, anyway), and who are working with fairly constrained hierarchy sizes, take some time to investigate the matrix path encoding technique that Tropashko describes. You can think of it as a "best of both worlds" approach to the problem that doesn't have the heavy recalculation requirements. I was also able to obtain favorable performance results for many of the descendant- and ancestor-type queries. If nothing else, it will expand your db math skillset:-D
And finally, for those with some disk space to burn, you might consider materializing the transitive closure of the hierarchy. This can really net you some performance if you have a large hierarchy that is used primarily for reads (the materialization is a rather expensive process).
-TroyK

Thanks for the great feedback, Troy. I'll take a look at the link you so kindly provided.

Shifting gears, I hope Thursday's article will help solve some of the problems for large hierarchies that are read heavy.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 9

Theo Ekelmans (11/13/2012)
LOL, gimme thor class hammer!!!
The project i'm working on is a roof of concept SNMP MIB database, that is so fast it can do (near)realtime SNMP data packet to SQL data conversion
Just in case your new to the extensible SNMP MIB tree:
If E-mail addresses were OIDs... user@ordina.org would have been something like: user@ordina.enterprises.private.internet.dod.org.iso
or in non-translated notation: user@99999.1.4.1.6.3.1
except that we write the top-most part at the left: 1.3.6.1.4.1.99999.117.115.101.114
An OID is just a unique key (within one managed device) for one piece of information Ensures vendors don't have conflicting OIDs
OID corresponds to a label .1.3.6.1.2.1.1.5 => sysName (in windows this is the servername)
The complete (translated) path: .iso.org.dod.internet.mgmt.mib-2.system.sysName
The info needed to get from one to an other is "mib (database) files", that are supplied with each new device / software version.
The provided MIB files all have a part of the ginormous jigsaw MIB tree, and that needs to be searched between 500 to 20.000 times a minute.
Fun project indeed 🙂

Sounds VERY interesting and fun. Thank you very much for the overview on your project. Unfortunately and now that I see what you're doing, I don't believe this coming Thursday's article will help you much. If level based counts and sums are important, it might.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Michael Meierruth SSChampion Points: 10051 More actions · Answer 10

Cyclone (11/14/2012)
Hi ...
This may be a silly question, but how does one use the dbo.Hierarchy table ?
Thanks.

If you know absolutely nothing about HierarchyID this is a good place to start: http://www.sqlservercentral.com/articles/SQL+Server+2008/62204/

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 11

IgorMi (11/13/2012)
Hi Jeff,
Simply brilliant!
Thanks
IgorMi

Thanks, Igor. I really appreciate the compliment. :blush:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 12

Michael Meierruth (11/13/2012)
Jeff,
In the code for generating the tally table, you use this syntax:
row_number() over (order by (select null))
I have never see this before. But it works. What is this '(select null)' trick all about?
PS
Very, very neat article!

Thanks, Michael. And if I haven't said it recently, thanks again for the help way back on the "reducing spaces" discussion from a couple of years ago.

jimbobmcgee nailed the explanation of the "SELECT null". The ORDER BY doesn't like naked constants even if they're not numeric. Let's hope that MS doesn't deprecate that because it'll tick off a whole lot of people that are using it in a whole lot of places.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 13

jimbobmcgee (11/13/2012)
Michael Meierruth (11/13/2012)
What is this '(select null)' trick all about?
It's a trick to 'order' by constant value, i.e. to not order/sort at all; in this case it adds a row number number based on the natural order that the processor hit the rows, rather than pre-sorting by a given column(set). You can't just do OVER(ORDER BY 1), because it interprets it as attempting to use the first column in the select, which OVER() won't let you do, but it does seem to let you get away with OVER(ORDER BY (SELECT 1)) [or, in this case, OVER(ORDER BY (SELECT NULL))].
I can't see in the BOL syntax tree where a select is possible (bar some vague reference to an 'expression') but it's been stable enough for me in the past. Maybe someone with a keener understanding of the nuances of the BOL can spot the bit that makes it a 'documented feature'.
In any case, I have often wondered if it might be equivalent to `OVER (ORDER BY the_clustered_index_col)` (at least for a plain table), or whether that would incur more reads. I have never bothered to try because OVER(ORDER BY (SELECT NULL)) always worked well enough for me.

Thank you for the fine cover on this question. I had a heck of a time at work today because of some problems with some 3rd party code. Didn't get a fix in until just a couple of hours ago.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Michael Meierruth SSChampion Points: 10051 More actions · Answer 14

jimbobmcgee (11/13/2012)
Michael Meierruth (11/13/2012)
What is this '(select null)' trick all about?
It's a trick to 'order' by constant value, i.e. to not order/sort at all; in this case it adds a row number number based on the natural order that the processor hit the rows, rather than pre-sorting by a given column(set). You can't just do OVER(ORDER BY 1), because it interprets it as attempting to use the first column in the select, which OVER() won't let you do, but it does seem to let you get away with OVER(ORDER BY (SELECT 1)) [or, in this case, OVER(ORDER BY (SELECT NULL))].
I can't see in the BOL syntax tree where a select is possible (bar some vague reference to an 'expression') but it's been stable enough for me in the past. Maybe someone with a keener understanding of the nuances of the BOL can spot the bit that makes it a 'documented feature'.
In any case, I have often wondered if it might be equivalent to `OVER (ORDER BY the_clustered_index_col)` (at least for a plain table), or whether that would incur more reads. I have never bothered to try because OVER(ORDER BY (SELECT NULL)) always worked well enough for me.

I'll keep this in my drawer of tricks.

In fact, in place of '(select 1)' you can use anything that's not a constant, e.g. @@servername, getdate(), etc. (but for some strange reason it doesn't like @@datefirst)

In fact a table containing a single int column which is also an identity column cannot be filled using 'insert into'.

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 15

Michael Meierruth (11/14/2012)
Cyclone (11/14/2012)
Hi ...
This may be a silly question, but how does one use the dbo.Hierarchy table ?
Thanks.
If you know absolutely nothing about HierarchyID this is a good place to start: http://www.sqlservercentral.com/articles/SQL+Server+2008/62204/

I believe the question was actually about how to use the table from the article... not the HierarchyID.

@Cyclone,

To answer the question that I think your asking, that's some of the "hundreds of chapters and thousands of articles" thing that I talked about in the article. The final dbo.Hierarchy table ends up having 3 different types of hierarical structures (Adjacency List, Materialized Path, and Nested Sets) in it. Each structure has it's own advantages that are quite different between the 3 structures. The article summarized that the Adjacenncy List is very easy to maintain and that the Nested Sets screams for reporting performance. Since the subjects are so large, I recommend you get a copy of Joe Celko's book that both he and I have already mentioned. Joe Celko has also posted a wad of information about the Nested Sets contain in the dbo.Hierarchy table and you should be able to find that pretty easily by doing a search for "Celko Nested Sets".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)