Displaying Hierarchical Data

  • Comments posted to this topic are about the item Displaying Hierarchical Data

  • There is no need to make another scan on the table in the outer SELECT just to get things like the ManagerName.

    Also, if the CEO want's to be listed at the top, she should be included as the root node at the beginning of the return.  I suspect that she should also have her own department but that's another story.

    Here's the code that covers both of the points from above.


       WITH HierarchyList_CTE AS
    (
     SELECT  StaffID, StaffName, ManagerID, Department
            ,ManagerName = StaffName
            ,StaffLevel = 1
       FROM Reference.Staff
      WHERE ManagerID IS NULL
      UNION ALL
     SELECT  tbl.StaffID, tbl.StaffName, tbl.ManagerID, tbl.Department
            ,ManagerName = cte.StaffName
            ,StaffLevel = cte.StaffLevel + 1
       FROM Reference.Staff tbl
       JOIN HierarchyList_CTE cte
         ON tbl.ManagerID = cte.StaffID
    )
     SELECT Department, StaffName, ManagerName, StaffLevel
       FROM HierarchyList_CTE
    ;

    I could be wrong about the requirements but I also suspect that she wants to see the hierarchical order of direct reports in a ,manner similar to what you have in Figure 1.  The addition of a "SortPath" using a bit of concatenation will handle that nicely.  The SortPath in the following is based on the StaffID.


       WITH HierarchyList_CTE AS
    (
     SELECT  StaffID, StaffName, ManagerID, Department
            ,ManagerName = StaffName
            ,StaffLevel = 1
            ,SortPath = CONVERT(VARBINARY(4000),CONVERT(BINARY(4),StaffID))
       FROM Reference.Staff
      WHERE ManagerID IS NULL
      UNION ALL
     SELECT  tbl.StaffID, tbl.StaffName, tbl.ManagerID, tbl.Department
            ,ManagerName = cte.StaffName
            ,StaffLevel = cte.StaffLevel + 1
            ,SortPath = CONVERT(VARBINARY(4000),cte.SortPath + CONVERT(BINARY(4),tbl.StaffID))
       FROM Reference.Staff tbl
       JOIN HierarchyList_CTE cte
         ON tbl.ManagerID = cte.StaffID
    )
     SELECT Department, StaffName, ManagerName, StaffLevel
       FROM HierarchyList_CTE
      ORDER BY SortPath
    ;

    If the sort needs to be by name within each set of names in the hierarchy and still maintain the hierarchical structure, it will be less efficient but the following will do such a thing...


       WITH HierarchyList_CTE AS
    (
     SELECT  StaffID, StaffName, ManagerID, Department
            ,ManagerName = StaffName
            ,StaffLevel = 1
            ,SortPath = CONVERT(VARCHAR(8000),StaffName)
       FROM Reference.Staff
      WHERE ManagerID IS NULL
      UNION ALL
     SELECT  tbl.StaffID, tbl.StaffName, tbl.ManagerID, tbl.Department
            ,ManagerName = cte.StaffName
            ,StaffLevel = cte.StaffLevel + 1
            ,SortPath = CONVERT(VARCHAR(8000),cte.SortPath + tbl.StaffName)
       FROM Reference.Staff tbl
       JOIN HierarchyList_CTE cte
         ON tbl.ManagerID = cte.StaffID
    )
     SELECT Department, StaffName, ManagerName, StaffLevel
       FROM HierarchyList_CTE
      ORDER BY SortPath
    ;

    Here are the results from that code.
    

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • CONVERT(VARBINARY(4000),CONVERT(BINARY(4),StaffID))

    The code you provided looks very useful and thorough.  But why is the varbinary set so high?  Why binary at all?

  • Where's the correct staff table? The one listed has no StaffID or ManagerID.

  • RonKyle - Thursday, August 30, 2018 7:29 AM

    CONVERT(VARBINARY(4000),CONVERT(BINARY(4),StaffID))

    The code you provided looks very useful and thorough.  But why is the varbinary set so high?  Why binary at all?

    If you're talking to me, the answer is two fold...

    1. Storing a character based representation of the IDs is wasteful and slow, especially for what I use the "sortpath" for.
    2. Storing as varbinary allows each integer for the IDs to be stored/concatenated as a very predictable concatenation of BINARY(4) elements which makes for a very fast split of the hierarchical path to very quickly calculate the number of nodes in the "downline" of any given node for the very high performance conversion to nested sets and other calculations.

    As for why VARBINARY(4000), make it anything you want.   I used 4,000 (4000/4 = 1,000 levels) to make sure that most people would never run out room for levels between the given node and the root.  Considering that the hierarchical BOM for a 747 is only about 18 levels deep, 1,000 levels is overkill for most applications except MLM companies. 

    Please see the following articles for more information on the on how I handle DAGs (Directed Acyclic Graphs {Hierarchies}).  The two articles include one of the fastest methods ever developed for conversion from Adjacency Lists to Nested Sets and a very high performance method for doing most of the things people would ask of a hierarchy and storing it all in a pre-aggregated reference table.  To whet your appetite, both methods resolve a million node hierarchy in about 54 seconds, allow for the continued ease of maintenance through an Adjacency List, calculates the hierarchical path in the VARBINARY "sort path" column (for whatever you may need such a thing for), and creates the Nested Sets "bowers", all in the same table.

    Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets
    Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • RonKyle - Thursday, August 30, 2018 7:29 AM

    CONVERT(VARBINARY(4000),CONVERT(BINARY(4),StaffID))

    The code you provided looks very useful and thorough.  But why is the varbinary set so high?  Why binary at all?

    Each level of recursion will concatenate additional data to the SortPath, so the length of SortPath will grow depending on the number of levels in the hierarchical data.  This is why the query needs to anticipate long values for SortPath ("why varbinary set so high").

    Assuming StaffID is an int, it is 4 bytes long.  This needs to be concatenated to form SortPath.  So the query either needs to treat StaffID as binary, or to convert it to a string representation.  4 bytes is much shorter than the 11 string characters it would take to represent the longest possible int value (-2147483648)  This is "why binary at all".

    Another approach that I tend to prefer would be to use ROW_NUMBER() to generate a sequential value, and then cast that to as string and concatenate that with a delimiter.  I call this a HUID (hierarchical unique identifier).  It ends up being something like '3.2.1.2' or whatever:  3rd row in the first level of the hierarchy, 2nd row in the second level, 1st row in the third level, and so on.  Padding those sequential values allows for a Sort HUID, such as '0003.0002.0001.0002'.

    Whether it is better to cast a column that is part of the native data or to generate a synthetic sequence number is probably a matter of personal taste and the nature of the native data.

  • Jeff Moden - Thursday, August 30, 2018 8:04 AM

    RonKyle - Thursday, August 30, 2018 7:29 AM

    CONVERT(VARBINARY(4000),CONVERT(BINARY(4),StaffID))

    The code you provided looks very useful and thorough.  But why is the varbinary set so high?  Why binary at all?

    If you're talking to me, the answer is two fold...

    1. Storing a character based representation of the IDs is wasteful and slow, especially for what I use the "sortpath" for.
    2. Storing as varbinary allows each integer for the IDs to be stored/concatenated as a very predictable concatenation of BINARY(4) elements which makes for a very fast split of the hierarchical path to very quickly calculate the number of nodes in the "downline" of any given node for the very high performance conversion to nested sets and other calculations.

    As for why VARBINARY(4000), make it anything you want.   I used 4,000 (4000/4 = 1,000 levels) to make sure that most people would never run out room for levels between the given node and the root.  Considering that the hierarchical BOM for a 747 is only about 18 levels deep, 1,000 levels is overkill for most applications except MLM companies. 

    Please see the following articles for more information on the on how I handle DAGs (Directed Acyclic Graphs {Hierarchies}).  The two articles include one of the fastest methods ever developed for conversion from Adjacency Lists to Nested Sets and a very high performance method for doing most of the things people would ask of a hierarchy and storing it all in a pre-aggregated reference table.  To whet your appetite, both methods resolve a million node hierarchy in about 54 seconds, allow for the continued ease of maintenance through an Adjacency List, calculates the hierarchical path in the VARBINARY "sort path" column (for whatever you may need such a thing for), and creates the Nested Sets "bowers", all in the same table.

    Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets
    Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

    I was talking to you, although the subsequent poster also provided useful information.  This is all very interesting, and you have indeed whet my appetite for trying some of this on some recursions that we do here.  Thanks!

  • Larry Gilbert-474089 - Thursday, August 30, 2018 7:38 AM

    Where's the correct staff table? The one listed has no StaffID or ManagerID.

    If you go to the bottom of the article, you'll find a "Resources" section with a ZIP file in it that Adam was kind enough to provide for all of his articles.  That ZIP file contains all of the DDL to create the tables that his articles reference and the DML to populate them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • And why hardcode the 'Finance'? 

    Just pass in a parameter @Dept and change the WHERE clause:

    WHERE   STF.Department = @Dept or @Dept = ''.

    I am creating a CTE for a Personnel/Staffing model and appreciate their flexibility.

  • Just to spice things up, check out this blog post I wrote back in 2010 (sheesh, was it THAT long ago??) that will display a hierarchy in a graphical way:
    http://bradsruminations.blogspot.com/2010/04/t-sql-tuesday-005-reporting.html

  • If this is the data you have to work with then it is a nice elegant solution 

    But when dealing with hierarchical data I prefer the Nested Set Model
    Put forth by Joe Celko. 
    I used this model in a previous life to relate countries -> manufacturing plants -> Assembly Lines -> Products built.
    I think it worked really well especially when it came to reporting.  It is flexible enough to report on any level.

    https://en.wikipedia.org/wiki/Nested_set_model

  • Mike Is Here - Friday, August 31, 2018 7:26 AM

    If this is the data you have to work with then it is a nice elegant solution 

    But when dealing with hierarchical data I prefer the Nested Set Model
    Put forth by Joe Celko. 
    I used this model in a previous life to relate countries -> manufacturing plants -> Assembly Lines -> Products built.
    I think it worked really well especially when it came to reporting.  It is flexible enough to report on any level.

    https://en.wikipedia.org/wiki/Nested_set_model

    This method seems like it has a lot of drawbacks.  Even the article states the use case is rare.

  • Brad Schulz - Thursday, August 30, 2018 4:48 PM

    Just to spice things up, check out this blog post I wrote back in 2010 (sheesh, was it THAT long ago??) that will display a hierarchy in a graphical way:
    http://bradsruminations.blogspot.com/2010/04/t-sql-tuesday-005-reporting.html

    Gosh.  So simple but really useful!  Thanks.

  • RonKyle - Friday, August 31, 2018 7:37 AM

    This method seems like it has a lot of drawbacks.  Even the article states the use case is rare.

    Well that is wikipedia for you.  I can't link to the book but this blog argues the opposite that the adjacency model has more limitations.
    I don't agree with the part of referential integrity stated in the wiki.  I just don't find that to be the case, if a node does not fit a particular data set then it would have no rows in the corresponding table.

    The issue with multiple inheritance is rare and technically could be handle in the nested model as long as the parent of the two types are not queried.  So in the example you can put Oak under trees and wood types and that would work as long as you don't query Objects which have Trees and Wood-Types, that would double count Oaks.
    But at that point I would create two hierarchical tables at that point since they represent two different subjects 
    Trees would be in a 'Source Type' hierarchical table and Wood-Types in a 'Material Type' hierarchical table

    I actually worked through this problem now that I am typing this out.  We had parts that were pre-assembled in the UK and assembled into a bigger machine in the USA.  This was not a problem until we try to report on the entire company (rarely done) that duplication of that part was an issue but even then you would aggregate the data on a part level then take distinct.

  • Mike Is Here - Friday, August 31, 2018 9:10 AM

    RonKyle - Friday, August 31, 2018 7:37 AM

    This method seems like it has a lot of drawbacks.  Even the article states the use case is rare.

    Well that is wikipedia for you.  I can't link to the book but this blog argues the opposite that the adjacency model has more limitations.
    I don't agree with the part of referential integrity stated in the wiki.  I just don't find that to be the case, if a node does not fit a particular data set then it would have no rows in the corresponding table.

    The issue with multiple inheritance is rare and technically could be handle in the nested model as long as the parent of the two types are not queried.  So in the example you can put Oak under trees and wood types and that would work as long as you don't query Objects which have Trees and Wood-Types, that would double count Oaks.
    But at that point I would create two hierarchical tables at that point since they represent two different subjects 
    Trees would be in a 'Source Type' hierarchical table and Wood-Types in a 'Material Type' hierarchical table

    I actually worked through this problem now that I am typing this out.  We had parts that were pre-assembled in the UK and assembled into a bigger machine in the USA.  This was not a problem until we try to report on the entire company (rarely done) that duplication of that part was an issue but even then you would aggregate the data on a part level then take distinct.

    The whole problem is the misunderstanding that people have about how to handle hierarchical data when Hierarchical data needs to be handled in an RDBMS...  There are 3 basic types of Hierarchical structures for "DAGs" (Directed Acyclic Graphs) where each node has one and only one parent.  Those are the Adjacency List, the Hierarchical Path, and Nested Sets.  Each has advantages and each has some serious disadvantages.

    1.  Adjacency Lists
    == Easy for humans to maintain and troubleshoot. Change just one manager ID to move a whole sub-tree, for example.
    == Not so easy to aggregate or do other things with.
    == Fairly slow because of it's near RBAR nature (not totally RBAR because it works with one iteration per level returned, if done correctly)
    == Easy and lightweight to index.

    2.  Hierarchical Path  (this is what the HierarchyID datatype is based on and comes in two flavors... positional concatenationand content concatenation).
    == Because each node is "aware" of it's entire upline, very difficult for humans to maintain and troubleshoot.
    == Requires special code to interpret the hierarchical path to return ancestors and descendants and can be horribly slow, which is why MS wrote CLRs for HierarchyID usage.  Without such special code, most humans would struggle with this type of hierarchical structure.
    == Can be difficult to aggregate or do other things with.
    == Requires some not so easy and not so lightweight indexes to get good performance with.
    == Easy and fastest to calculate uplines from because the hierarchical path contains the IDs of every node in the upline.

    3.  Nested Sets
    == Because each node is "aware" of ALL the other nodes in the entire hierarchy, this type of hierarchical structure is very difficult to maintain and humans mostly don't stand a prayer of troubleshooting it if something goes out of whack).
    == Incredibly easy and nasty fast to do a myriad of things including aggregates, etc.  Almost everything is nasty fast set based code with the only exception being the upline.
    == Unless you know the "key" for how to do it in a very high performance manner, Nested Sets can take literally days to rebuild from an Adjacency List (about 2 days for a lousy million nodes for BOMs, MLMs, etc.  Worse yet, the typical "push stack" method for doing such a thing is a terrible resource hog.  Because it takes so long to resolve the hierarchy, additional special code has to be written to add, move, or delete nodes.  One slip an kiss it all goodbye.

    And that's the problem.... none of the 3 structures are a panacea.  So, you actually need all 3.and that's what the two articles I previously provided the links for provide.  They also provide a method for created both the Hierarchical Path and the Nested Sets for a million node hierarchy in 54 seconds.

    Then, there's the "pre-aggregated" hierarch structure that I introduced in the second of the two links I provided and it also does a full rebuild with most everything you could ask of a hierarchy already answered in the table (think hierarchical data warehouse) in only 54 seconds.

    Don't shortcut yourself by settling on just one hierarchical structure.   Use all 3 or 4.  It takes remarkably little time now after we got rid of the highly RBAR push-stack method of creating Nested Sets.

    Maintain just the Adjacency List and let the code do the rest for you.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply