Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

  • Excellent article, Jeff! I look forward to Thursday's installment.

    I've played around with mixing these two representations of hierarchies before, but had never gotten past the performance issues, so it's great to have a new approach to try. It seems that I keep inheriting db implementations that contain at least one hierarchy in them, so having more choices for dealing with them is a very good thing!

    For those that are willing to give up a little in terms of understandability (which can be dealt with via abstractions, anyway), and who are working with fairly constrained hierarchy sizes, take some time to investigate the matrix path encoding technique that Tropashko describes. You can think of it as a "best of both worlds" approach to the problem that doesn't have the heavy recalculation requirements. I was also able to obtain favorable performance results for many of the descendant- and ancestor-type queries. If nothing else, it will expand your db math skillset:-D

    And finally, for those with some disk space to burn, you might consider materializing the transitive closure of the hierarchy. This can really net you some performance if you have a large hierarchy that is used primarily for reads (the materialization is a rather expensive process).

    -TroyK

  • Hi Jeff,

    Simply brilliant!

    Thanks

    IgorMi

    Igor Micev,My blog: www.igormicev.com

  • Jeff,

    In the code for generating the tally table, you use this syntax:

    row_number() over (order by (select null))

    I have never see this before. But it works. What is this '(select null)' trick all about?

    PS

    Very, very neat article!

  • Michael Meierruth (11/13/2012)


    What is this '(select null)' trick all about?

    It's a trick to 'order' by constant value, i.e. to not order/sort at all; in this case it adds a row number number based on the natural order that the processor hit the rows, rather than pre-sorting by a given column(set). You can't just do OVER(ORDER BY 1), because it interprets it as attempting to use the first column in the select, which OVER() won't let you do, but it does seem to let you get away with OVER(ORDER BY (SELECT 1)) [or, in this case, OVER(ORDER BY (SELECT NULL))].

    I can't see in the BOL syntax tree where a select is possible (bar some vague reference to an 'expression') but it's been stable enough for me in the past. Maybe someone with a keener understanding of the nuances of the BOL can spot the bit that makes it a 'documented feature'.

    In any case, I have often wondered if it might be equivalent to `OVER (ORDER BY the_clustered_index_col)` (at least for a plain table), or whether that would incur more reads. I have never bothered to try because OVER(ORDER BY (SELECT NULL)) always worked well enough for me.

  • Nice job Jeff.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Hi ...

    This may be a silly question, but how does one use the dbo.Hierarchy table ?

    Thanks.

  • CELKO (11/13/2012)


    This was really good!

    Thanks, Joe. I appreciate you stopping by especially with the code to convert Nested Sets to an Adjaceny List.

    I would drop the needlessly proprietary code (ISNULL? We have COALESCE today, CAST not CONVERT, AS not = for aliases, etc).

    I appreciate your thoughts on that but as soon as you use a variable (or just about anything else of particular good use), you lose portability no matter which database engine you're writing code for. Besides true portability being a myth, I've never believed in trying to make code truly portable because you can miss out on a lot of the horsepower of some of the extensions to SQL. For example, I used the proprietary ISNULL because it's a bit faster than COALESCE for several reasons. I also intentionally used the proprietary UPDATE to avoid doing multiple table scans during multiple updates to do what could be done in a single scan to create the Left and Right Bowers. Most other engines won't allow you to update a variable and a column at the same time and that's a real shame but I'm still going to use the technique in SQL Server because it can be used to easily avoid multiple and, therefor, unnecessary table scans.

    I also use the alias = expression form of assignment for several reasons... it's the same form as the UPDATE statement in virtually any SQL engine and it makes the code a whole lot easier to read because the all-important column alias will always start at character 9 in my formatted code. No need to look for it at the ragged-right end of multiple length expressions to find the target.

    To wit, even you write non-portable code. For example, the very first piece of code you offer in your fine book titled "Trees and Hierarchies in SQL for Smarties" (1st edition) has the following code on page 6.

    DECLARE [font="Arial Black"]ARRAY [/font]GraphList

    [font="Arial Black"]OF RECORD [/font][edge CHAR(1), in_node INTEGER, out_node INTEGER];

    ANSI or not, that's not truly portable code because it doesn't port to SQL Server. ANSI means nothing when it comes to portability except maybe for simple CRUD procs (which are normally taken care of by various ORMs now-a-days) because no database engine is 100% ANSI compliant even on such simple things as how to declare a variable. I rue the day that they ever are because there would be no competition to improve. 😉 In fact, ANSI has adopted many things as “ANSI Standard” that were once proprietary extensions to various dialects of SQL.

    Heh… and let’s see you port a trigger between Oracle and SQL Server with no changes. Now THAT’s some serious “fun”!

    The same is true on the Oracle side of the fence (for example). If someone were to try something similar to this article in Oracle, I'd tell them not to bother because it's nearly as (or maybe even more so) efficient to use CONNECT BY for such things. If it turned out that I could get even more speed using something proprietary to Oracle, I'd use that instead because we're not playing around with 10 or even 10 thousand rows. When you work with millions of rows, every microsecond you can save per "function" per row adds up in a hurry.

    Like I said, I appreciate your thoughts on the subject but will continue to use and advertise the power of the SQL extensions for whatever database engine I happen to be working in to get the best performance possible. Besides, I've only had to port code once since 1995 and it just isn't that difficult to do. It's just not worth crippling the performance of code by avoiding the proprietary extensions just on the very remote chance that you might have to port code someday.

    BWAAA-HAAAA!!! Imagine it if someone told you you couldn't use the code from the previously cited page 6 because it wouldn't port to SQL Server. That would be a fun moment to watch! :w00t:

    Got features? Use them! 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • cs_troyk (11/13/2012)


    Excellent article, Jeff! I look forward to Thursday's installment.

    I've played around with mixing these two representations of hierarchies before, but had never gotten past the performance issues, so it's great to have a new approach to try. It seems that I keep inheriting db implementations that contain at least one hierarchy in them, so having more choices for dealing with them is a very good thing!

    For those that are willing to give up a little in terms of understandability (which can be dealt with via abstractions, anyway), and who are working with fairly constrained hierarchy sizes, take some time to investigate the matrix path encoding technique that Tropashko describes. You can think of it as a "best of both worlds" approach to the problem that doesn't have the heavy recalculation requirements. I was also able to obtain favorable performance results for many of the descendant- and ancestor-type queries. If nothing else, it will expand your db math skillset:-D

    And finally, for those with some disk space to burn, you might consider materializing the transitive closure of the hierarchy. This can really net you some performance if you have a large hierarchy that is used primarily for reads (the materialization is a rather expensive process).

    -TroyK

    Thanks for the great feedback, Troy. I'll take a look at the link you so kindly provided.

    Shifting gears, I hope Thursday's article will help solve some of the problems for large hierarchies that are read heavy.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Theo Ekelmans (11/13/2012)


    LOL, gimme thor class hammer!!!

    The project i'm working on is a roof of concept SNMP MIB database, that is so fast it can do (near)realtime SNMP data packet to SQL data conversion

    Just in case your new to the extensible SNMP MIB tree:

    If E-mail addresses were OIDs... user@ordina.org would have been something like: user@ordina.enterprises.private.internet.dod.org.iso

    or in non-translated notation: user@99999.1.4.1.6.3.1

    except that we write the top-most part at the left: 1.3.6.1.4.1.99999.117.115.101.114

    An OID is just a unique key (within one managed device) for one piece of information Ensures vendors don't have conflicting OIDs

    OID corresponds to a label .1.3.6.1.2.1.1.5 => sysName (in windows this is the servername)

    The complete (translated) path: .iso.org.dod.internet.mgmt.mib-2.system.sysName

    The info needed to get from one to an other is "mib (database) files", that are supplied with each new device / software version.

    The provided MIB files all have a part of the ginormous jigsaw MIB tree, and that needs to be searched between 500 to 20.000 times a minute.

    Fun project indeed 🙂

    Sounds VERY interesting and fun. Thank you very much for the overview on your project. Unfortunately and now that I see what you're doing, I don't believe this coming Thursday's article will help you much. If level based counts and sums are important, it might.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Cyclone (11/14/2012)


    Hi ...

    This may be a silly question, but how does one use the dbo.Hierarchy table ?

    Thanks.

    If you know absolutely nothing about HierarchyID this is a good place to start: http://www.sqlservercentral.com/articles/SQL+Server+2008/62204/

  • IgorMi (11/13/2012)


    Hi Jeff,

    Simply brilliant!

    Thanks

    IgorMi

    Thanks, Igor. I really appreciate the compliment. :blush:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Michael Meierruth (11/13/2012)


    Jeff,

    In the code for generating the tally table, you use this syntax:

    row_number() over (order by (select null))

    I have never see this before. But it works. What is this '(select null)' trick all about?

    PS

    Very, very neat article!

    Thanks, Michael. And if I haven't said it recently, thanks again for the help way back on the "reducing spaces" discussion from a couple of years ago.

    jimbobmcgee nailed the explanation of the "SELECT null". The ORDER BY doesn't like naked constants even if they're not numeric. Let's hope that MS doesn't deprecate that because it'll tick off a whole lot of people that are using it in a whole lot of places.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jimbobmcgee (11/13/2012)


    Michael Meierruth (11/13/2012)


    What is this '(select null)' trick all about?

    It's a trick to 'order' by constant value, i.e. to not order/sort at all; in this case it adds a row number number based on the natural order that the processor hit the rows, rather than pre-sorting by a given column(set). You can't just do OVER(ORDER BY 1), because it interprets it as attempting to use the first column in the select, which OVER() won't let you do, but it does seem to let you get away with OVER(ORDER BY (SELECT 1)) [or, in this case, OVER(ORDER BY (SELECT NULL))].

    I can't see in the BOL syntax tree where a select is possible (bar some vague reference to an 'expression') but it's been stable enough for me in the past. Maybe someone with a keener understanding of the nuances of the BOL can spot the bit that makes it a 'documented feature'.

    In any case, I have often wondered if it might be equivalent to `OVER (ORDER BY the_clustered_index_col)` (at least for a plain table), or whether that would incur more reads. I have never bothered to try because OVER(ORDER BY (SELECT NULL)) always worked well enough for me.

    Thank you for the fine cover on this question. I had a heck of a time at work today because of some problems with some 3rd party code. Didn't get a fix in until just a couple of hours ago.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • jimbobmcgee (11/13/2012)


    Michael Meierruth (11/13/2012)


    What is this '(select null)' trick all about?

    It's a trick to 'order' by constant value, i.e. to not order/sort at all; in this case it adds a row number number based on the natural order that the processor hit the rows, rather than pre-sorting by a given column(set). You can't just do OVER(ORDER BY 1), because it interprets it as attempting to use the first column in the select, which OVER() won't let you do, but it does seem to let you get away with OVER(ORDER BY (SELECT 1)) [or, in this case, OVER(ORDER BY (SELECT NULL))].

    I can't see in the BOL syntax tree where a select is possible (bar some vague reference to an 'expression') but it's been stable enough for me in the past. Maybe someone with a keener understanding of the nuances of the BOL can spot the bit that makes it a 'documented feature'.

    In any case, I have often wondered if it might be equivalent to `OVER (ORDER BY the_clustered_index_col)` (at least for a plain table), or whether that would incur more reads. I have never bothered to try because OVER(ORDER BY (SELECT NULL)) always worked well enough for me.

    I'll keep this in my drawer of tricks.

    In fact, in place of '(select 1)' you can use anything that's not a constant, e.g. @@servername, getdate(), etc. (but for some strange reason it doesn't like @@datefirst)

    In fact a table containing a single int column which is also an identity column cannot be filled using 'insert into'.

  • Michael Meierruth (11/14/2012)


    Cyclone (11/14/2012)


    Hi ...

    This may be a silly question, but how does one use the dbo.Hierarchy table ?

    Thanks.

    If you know absolutely nothing about HierarchyID this is a good place to start: http://www.sqlservercentral.com/articles/SQL+Server+2008/62204/

    I believe the question was actually about how to use the table from the article... not the HierarchyID.

    @Cyclone,

    To answer the question that I think your asking, that's some of the "hundreds of chapters and thousands of articles" thing that I talked about in the article. The final dbo.Hierarchy table ends up having 3 different types of hierarical structures (Adjacency List, Materialized Path, and Nested Sets) in it. Each structure has it's own advantages that are quite different between the 3 structures. The article summarized that the Adjacenncy List is very easy to maintain and that the Nested Sets screams for reporting performance. Since the subjects are so large, I recommend you get a copy of Joe Celko's book that both he and I have already mentioned. Joe Celko has also posted a wad of information about the Nested Sets contain in the dbo.Hierarchy table and you should be able to find that pretty easily by doing a search for "Celko Nested Sets".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 99 total)

You must be logged in to reply to this topic. Login to reply