Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

  • Thomas S. Trias (11/15/2012)


    Excellent article; I have used all three types of tree representations before. In one project, we used a materialized path, and I kind of wish I'd thought to use varbinary instead of a hex string. However, I just tested using an index on a materialized path as a hex string using 100000 rows of sample data generated with BuildLargeEmployeeTable, and the results were perhaps not so surprising.

    SELECT

    ISNULL(EmployeeID, 0) AS EmployeeID,

    ISNULL(ManagerID, 0) AS ManagerID,

    ISNULL(CONVERT(varchar(8000), sortpath, 2), '') as SortPath

    INTO Hierarchy2

    FROM Hierarchy

    ALTER TABLE Hierarchy2 ADD CONSTRAINT PK_Hierarchy2 PRIMARY KEY CLUSTERED (EmployeeID)

    CREATE INDEX SortPath ON Hierarchy2 (SortPath)

    SQL Server will use a string index on LIKE operations anchored at the begining, so this:

    SELECT * FROM Hierarchy2 WHERE SortPath LIKE '0000000100000002000000130000001F0000003F00000058000000650000009600000935%'

    ends up being more than three times more efficient than this:

    SELECT * FROM Hierarchy2 WHERE SUBSTRING(SortPath, 65, 8) = '00000935'

    Unfortunately, the index only supports the first 900 bytes of the SortPath. Also, since the Nested Sets provide the same benefits (fast ancestor / descendant access and sorting), and since the update to calculate them using the SortPath is going to touch every row anyway, going with the approach you've got looks best. 🙂

    I guess since the rebuild is so fast, you also don't need to consider using double-precision floating-point numbers as the bowers (which makes updating the bowers in place a bit more manageable, since you can keep subdividing the ranges up to a certain point).

    Keep up the great work!

    Thanks for the feedback, Thomas.

    On the index thing, it turns out that adding even the final indexes before the data is complete will actually slow the code down. As you point out, since we're updating the whole table with each pass, indexes serve no real benefit to performance until all is said and done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Theo Ekelmans (11/13/2012)


    LOL, gimme thor class hammer!!!

    The project i'm working on is a roof of concept SNMP MIB database, that is so fast it can do (near)realtime SNMP data packet to SQL data conversion

    Just in case your new to the extensible SNMP MIB tree:

    If E-mail addresses were OIDs... user@ordina.org would have been something like: user@ordina.enterprises.private.internet.dod.org.iso

    or in non-translated notation: user@99999.1.4.1.6.3.1

    except that we write the top-most part at the left: 1.3.6.1.4.1.99999.117.115.101.114

    An OID is just a unique key (within one managed device) for one piece of information Ensures vendors don't have conflicting OIDs

    OID corresponds to a label .1.3.6.1.2.1.1.5 => sysName (in windows this is the servername)

    The complete (translated) path: .iso.org.dod.internet.mgmt.mib-2.system.sysName

    The info needed to get from one to an other is "mib (database) files", that are supplied with each new device / software version.

    The provided MIB files all have a part of the ginormous jigsaw MIB tree, and that needs to be searched between 500 to 20.000 times a minute.

    Fun project indeed 🙂

    I tried this once, so I feel your pain.

    The best I got was to store the OIDs flat in an indexed VARCHAR(1000) column and my metadata sparsely around that. Given that OIDs are essentially a materialized-path hierarchy, anyway, I was able to do parent>child navigation by `WHERE oid LIKE @parent_oid + '%'` and child>parent by `WHERE oid = LEFT(@child_oid, LEN(@child_oid) - CHARINDEX('.', REVERSE(@child_oid)))` (plus/minus a 1, here or there!).

    Maybe you could get some benefit from the binary approaches Jeff is using instead of the string-based approach I once took. I can't remember if there is an upper limit to the size of each 'part' of an OID but, if I were to guess, it would be a 0-64K 'word', which would convert to SMALLINT and on to BINARY(2) well enough (if it is actually a 0-4G dword, you would need to convert to INT and on to BINARY(4)).

    As part of your MIB-loading routine, split the OIDs by '.' [Jeff's earlier article on string-splitting (and the great discussion around it) at http://www.sqlservercentral.com/articles/Tally+Table/72993/ might be of help here], then convert each word to SMALLINT, then BINARY(2) and store the materialized path as concatenations of BINARY(2) into a indexed VARBINARY(2000).

    With a materialized path of fixed-size binary elements instead of variable-length string elements, you could get 'level' metadata from `DATALENGTH(oid) / 2`, child>parent from `SUBSTRING(@oid, 1, DATALENGTH(@oid) - 2)` and parent>child from `WHERE SUBSTRING(oid, 1, DATALENGTH(@parent_oid)) = @parent_oid`.

    If binary sorts are that good (and people here give me every reason to think that they are), you may be performant enough at this point.

  • Combined some of the code from your code in this article with your delimited split routine to solve a problem at work. Inspiration comes from many places.

    Have to give it to you Jeff, you are the Mentor of the Year in my book.

  • Lynn Pettis (11/28/2012)


    Combined some of the code from your code in this article with your delimited split routine to solve a problem at work. Inspiration comes from many places.

    Have to give it to you Jeff, you are the Mentor of the Year in my book.

    :blush: Thanks, Lynn. I really appreciate that. What did you end up doing?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/29/2012)


    Lynn Pettis (11/28/2012)


    Combined some of the code from your code in this article with your delimited split routine to solve a problem at work. Inspiration comes from many places.

    Have to give it to you Jeff, you are the Mentor of the Year in my book.

    :blush: Thanks, Lynn. I really appreciate that. What did you end up doing?

    Needed to sort DD Version Keys (x.x.x, like 2.1.10). The original code had no problems as it didn't have to work with a 2 digit value. When we went to 2.1.10 instead of 2.3.1, it was a problem. The original solution by another DBA used the OLE Automation Procedures to create a regex_replace function. Only problem is that there is a STIG requiring that these procedures be disabled.

    I'm sure there may be another solution, but combining the DelimitedSplit8K and the varbinary concatenation from this article I was able to generate a solution that correctly sorted the version keys, including a very unlikely set of numerical elements (x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n)) where n represents 1 or more digits.

    I wouldn't have thought of concatenating varbinary values had I not read this article.

  • Lynn,

    I see you too have run into this 'x.x.x...' problem.

    I have run into serveral end user applications where this was used to manage hierarchies. Of course, all 'order by' clauses became useless when one of the x values reached 10.

    I solved it by writing a function which converts 'x.x.x...' to two digit values starting at a certain level.

  • Lynn Pettis (11/29/2012)


    Jeff Moden (11/29/2012)


    Lynn Pettis (11/28/2012)


    Combined some of the code from your code in this article with your delimited split routine to solve a problem at work. Inspiration comes from many places.

    Have to give it to you Jeff, you are the Mentor of the Year in my book.

    :blush: Thanks, Lynn. I really appreciate that. What did you end up doing?

    Needed to sort DD Version Keys (x.x.x, like 2.1.10). The original code had no problems as it didn't have to work with a 2 digit value. When we went to 2.1.10 instead of 2.3.1, it was a problem. The original solution by another DBA used the OLE Automation Procedures to create a regex_replace function. Only problem is that there is a STIG requiring that these procedures be disabled.

    I'm sure there may be another solution, but combining the DelimitedSplit8K and the varbinary concatenation from this article I was able to generate a solution that correctly sorted the version keys, including a very unlikely set of numerical elements (x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n).x(n)) where n represents 1 or more digits.

    I wouldn't have thought of concatenating varbinary values had I not read this article.

    VERY cool. Thanks for posting the usage, Lynn.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, I have been playing with your RebuildNestedSets_MM stored procedure and I have a question. Is it possible to sort the nodes so that the Left and Right bowers are sorted on another field in the employee table? For example on Employee Name. My issue is I have a large hierarchy and I need the nodes in a specific order and looking through your proc it appears to always build/sort on employeeId. So in your article Lynne comes before Bob because her employeeId is 2 and Bob's is 3. However, I would like to change that sort on name so Bob would appear before Lynne. It looks like part of the beauty of your method is the nodes are arranged by employeeId.

    If I am missing something obvious, please don't beat me up to much.

    -Matthew Erdmann

  • LordHz (3/13/2013)


    Jeff, I have been playing with your RebuildNestedSets_MM stored procedure and I have a question. Is it possible to sort the nodes so that the Left and Right bowers are sorted on another field in the employee table? For example on Employee Name. My issue is I have a large hierarchy and I need the nodes in a specific order and looking through your proc it appears to always build/sort on employeeId. So in your article Lynne comes before Bob because her employeeId is 2 and Bob's is 3. However, I would like to change that sort on name so Bob would appear before Lynne. It looks like part of the beauty of your method is the nodes are arranged by employeeId.

    If I am missing something obvious, please don't beat me up to much.

    -Matthew Erdmann

    Heh... that's the problem with articles like this. They're not as big as a book. 😀

    Someone smarter than me will probably figure out a better way but, yes, you can sort from left to right by level but not after you've built the nested sets. You have to do it before you build the nested sets and you can't use the EmployeeID for the sort path anymore. Instead, you have to sort by alphabetical position and assign new child/parent values to maintain each nodes position in the alphabetic hierarchy while still maintaining the parent/child hierarchy. That sounds real tough to do but it's actually not. You just need to assign a new sequential value to each employee in alphabetical order and then find the match of that new number as the new value for each manager. It does require a couple of extra steps (sorting is always expensive) but the code runs so fast, I don't see what an extra minute or two on a million rows would be much of an inconvenience. And, no... I wouldn't use a self joining CTE for this because that'll get calculated more than once. You need a table to store it in, anyway.

    I'll post the code in a couple of minutes.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's one way. I apologize for not taking the time to make a more optimal solution but it's been a long day and I'm too pooped to pop.

    Basically, you need to add an interim table to replace the original Employee table as the source of the recursive CTE to make the sort path. This new table has a SortedChild column to replace the EmployeeID in the sort path calculation and a Parent column to use in the join criteria along with SortedChild.

    SELECT SortedChild = ROW_NUMBER() OVER (ORDER BY EmployeeName)

    , Parent = NULL

    , EmployeeID

    , ManagerID

    , EmployeeName

    INTO #SortedEmployee

    FROM dbo.Employee

    ;

    GO

    CREATE UNIQUE CLUSTERED INDEX IX_SortedEmployee_Composite01

    ON #SortedEmployee (EmployeeID);

    UPDATE child

    SET child.Parent = parent.SortedChild

    FROM #SortedEmployee child

    JOIN #SortedEmployee parent

    ON parent.EmployeeID = child.ManagerID

    ;

    --===== Display what we end up with.

    -- This is not a part of the solution

    SELECT * FROM #SortedEmployee ORDER BY SortedChild

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    I had considered that and was working on something along those lines before I left work. For pure performance reasons, I would create a new sorted node mapping table that would maintain the sorted Id/ParentId pairs and a reference to the original Id/ParentId's. Hadn't gotten past that, so if you have a complete solution I would be grateful. Using your method for updating my nested sets has reduced my rebuild time from 18 seconds to 400 milliseconds (8000 nodes, ~10 levels of depth). The only thing missing is solving this sort issue. It would be simpler if the application I was building this for pulled only one level at a time, since I could sort the nodes however I like. But the spec is for a flat table that shows the entire hierarchy and is sorted on a different field.

    Thanks again,

    Matthew

  • Jeff,

    I implemented your sort solution and it is perfect. Only change I need to make is to update the hierarchy table to support the sortedId/sortedParentId and the original ID/parentID fields.

    Thanks for you assistance and knowledge,

    Matthew E.

  • Thanks for the feedback. Basically, great minds think alike because all I did was make a "mapping" table. You could certainly incorporate that into your original employee table but I didn't want to lock the whole table with the update/table scan that would occur. I didn't know how big your table would be.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    Not putting that back in the original table (mines not employee), I'm just maintaining the mapping in your hierarchy table. My thought is to make your stored procedure more generic, so I can simply pass in the table name, Id/ParentId and the sort field(s) and use dynamic SQL to create a the sorted mapping table. That would then update the rgt/lft extends (your bowers) back into the original table. That way the less knowledgeable developers could implement without having to modify the stored procedure. At least that's my thought, haven't implemented it yet. Most of our hierarchies are in the under 100k row range, most under 10k. We do a lot of hierarchies, product, geo, object, person, etc. This update has provided some great brain food.:-D

    Thanks again,

    Matthew

  • Hi Jeff, great article.

    I have one quibble: terminology; even an old relic like me knows that "best bower" and "small bower" have been superseded by "starboard bower" and "port bower" but I've never before heard or seen "right bower" and "left bower". 😀

    Tom

Viewing 15 posts - 46 through 60 (of 99 total)

You must be logged in to reply to this topic. Login to reply