Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««34567»»»

Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets Expand / Collapse
Author
Message
Posted Wednesday, November 14, 2012 7:37 AM
Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Today @ 12:21 PM
Points: 523, Visits: 2,017
mister.magoo (11/14/2012)
you can use
row_number() over (order by @@spid) 


Yes, that's the one!
Post #1384628
Posted Wednesday, November 14, 2012 2:50 PM


Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Friday, April 18, 2014 4:23 PM
Points: 778, Visits: 1,505
row_number() over (order by @@spid)


I'm not against using obscure hacks, and there's always going to be a debate about the point at which you even classify something as an obscure hack. The discovery of a new, better way of doing things will start life as an obscure hack and eventually find its way into mainstream use. What I am against, is using obscure hacks without explaining what you're doing in your code. At some point Joe Rookie is going to inherit my code and is going to waste an awful lot of time working out why I'd ever order by @@spid if I haven't commented. I don't know where the line should be drawn regarding what to comment and what not to, but personally, I've always appreciated inheriting code where the original author went to a bit of effort.




One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important.
Bertrand Russell
Post #1384883
Posted Wednesday, November 14, 2012 4:51 PM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 4:50 PM
Points: 1,663, Visits: 5,230
GPO (11/14/2012)
row_number() over (order by @@spid)


I'm not against using obscure hacks, and there's always going to be a debate about the point at which you even classify something as an obscure hack. The discovery of a new, better way of doing things will start life as an obscure hack and eventually find its way into mainstream use. What I am against, is using obscure hacks without explaining what you're doing in your code. At some point Joe Rookie is going to inherit my code and is going to waste an awful lot of time working out why I'd ever order by @@spid if I haven't commented. I don't know where the line should be drawn regarding what to comment and what not to, but personally, I've always appreciated inheriting code where the original author went to a bit of effort.


Each to their own, but to my mind, I see no difference between ordering by a static variable like @@SPID and a NULL set...both are hacks according to Books Online, which only allows for columns made available by the FROM clause...

<ORDER BY Clause>
Specifies the order to apply the ranking window function. For more information, see ORDER BY Clause (Transact-SQL).

Important:
When used in the context of a ranking window function, <ORDER BY Clause> can only refer to columns made available by the FROM clause. An integer cannot be specified to represent the position of the name or alias of a column in the select list.


MM


  • MMGrid Addin
  • MMNose Addin


  • Forum Etiquette: How to post Reporting Services problems
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • How to Post Performance Problems - by Gail Shaw

  • Post #1384928
    Posted Wednesday, November 14, 2012 5:10 PM


    Right there with Babe

    Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

    Group: General Forum Members
    Last Login: Friday, April 18, 2014 4:23 PM
    Points: 778, Visits: 1,505
    @mister.magoo
    both are hacks


    Yes I totally agree. I was just using the @@spid as an example. My point (that if you're using obscure hacks you should explain what you're doing in your code) applies equally to using NULL in that context.




    One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important.
    Bertrand Russell
    Post #1384934
    Posted Thursday, November 15, 2012 1:59 PM


    SSCoach

    SSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoach

    Group: General Forum Members
    Last Login: Yesterday @ 10:04 AM
    Points: 15,442, Visits: 9,590
    Since I was out a few days, I missed this article till today.

    Interesting methodology.

    Testing it on my workstation, I got 45 seconds for a million-row hierarchy, 30 nodes deep.

    On 10k rows, 21 deep, took 593 milliseconds.

    Methods I posted in my article in August took 3 milliseconds at 10k rows, 29 seconds at 1M rows, same data. I'm using HierarchyIDs the same way you use your SortBy column. When I modified to Varbinary(max) instead of HiearchyID, it stays under 10 milliseconds for 10k rows, but goes up to 49 seconds for 1M rows, using those methods.

    In the Varbinary(max) version, 70-75% of the run-time was generating the breadcrumb path (SortBy), and 25-30% in generating the Nested Sets from it.

    Per my tests, HierarchyID will only easily go to a depth of a little more than 400 levels, so my methods won't work precisely on really deep hierarchies. Anything less than 400 levels, I'd still stick with HierarchyID, and seriously think about upgrading to a supported version of SQL Server if you can't use them. Not always possible, but it is worth it if you can.

    Didn't get a chance to try yours out with a hierarchy table with n-top levels, instead of just 1, as per the comments in your code. Wanted to make sure I was using the same testing methodology, so didn't play with your test-hierarchy-generator, in order to try these kinds of permutations.

    Of course, I'm not entirely sure why you'd use both breadcrumbs and nested sets. If you're already breadcrumbing, you can query that directly and avoid nested sets completely. Same performance as nested sets, in most cases, unless your path is really, really long (since it's a binary sort, until it goes over 8000 datalength; indexes do binary sorts very, very well).


    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
    Post #1385356
    Posted Thursday, November 15, 2012 4:22 PM
    Forum Newbie

    Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

    Group: General Forum Members
    Last Login: Monday, February 25, 2013 3:33 PM
    Points: 2, Visits: 21
    Excellent article; I have used all three types of tree representations before. In one project, we used a materialized path, and I kind of wish I'd thought to use varbinary instead of a hex string. However, I just tested using an index on a materialized path as a hex string using 100000 rows of sample data generated with BuildLargeEmployeeTable, and the results were perhaps not so surprising.

    SELECT
    ISNULL(EmployeeID, 0) AS EmployeeID,
    ISNULL(ManagerID, 0) AS ManagerID,
    ISNULL(CONVERT(varchar(8000), sortpath, 2), '') as SortPath
    INTO Hierarchy2
    FROM Hierarchy

    ALTER TABLE Hierarchy2 ADD CONSTRAINT PK_Hierarchy2 PRIMARY KEY CLUSTERED (EmployeeID)

    CREATE INDEX SortPath ON Hierarchy2 (SortPath)

    SQL Server will use a string index on LIKE operations anchored at the begining, so this:

    SELECT * FROM Hierarchy2 WHERE SortPath LIKE '0000000100000002000000130000001F0000003F00000058000000650000009600000935%'

    ends up being more than three times more efficient than this:

    SELECT * FROM Hierarchy2 WHERE SUBSTRING(SortPath, 65, 8) = '00000935'

    Unfortunately, the index only supports the first 900 bytes of the SortPath. Also, since the Nested Sets provide the same benefits (fast ancestor / descendant access and sorting), and since the update to calculate them using the SortPath is going to touch every row anyway, going with the approach you've got looks best.

    I guess since the rebuild is so fast, you also don't need to consider using double-precision floating-point numbers as the bowers (which makes updating the bowers in place a bit more manageable, since you can keep subdividing the ranges up to a certain point).

    Keep up the great work!
    Post #1385396
    Posted Thursday, November 15, 2012 11:18 PM


    SSC-Dedicated

    SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

    Group: General Forum Members
    Last Login: Today @ 5:22 PM
    Points: 36,016, Visits: 30,308
    GSquared (11/15/2012)
    Since I was out a few days, I missed this article till today.

    Interesting methodology.

    Testing it on my workstation, I got 45 seconds for a million-row hierarchy, 30 nodes deep.

    On 10k rows, 21 deep, took 593 milliseconds.

    Methods I posted in my article in August took 3 milliseconds at 10k rows, 29 seconds at 1M rows, same data. I'm using HierarchyIDs the same way you use your SortBy column. When I modified to Varbinary(max) instead of HiearchyID, it stays under 10 milliseconds for 10k rows, but goes up to 49 seconds for 1M rows, using those methods.

    In the Varbinary(max) version, 70-75% of the run-time was generating the breadcrumb path (SortBy), and 25-30% in generating the Nested Sets from it.

    Per my tests, HierarchyID will only easily go to a depth of a little more than 400 levels, so my methods won't work precisely on really deep hierarchies. Anything less than 400 levels, I'd still stick with HierarchyID, and seriously think about upgrading to a supported version of SQL Server if you can't use them. Not always possible, but it is worth it if you can.

    Didn't get a chance to try yours out with a hierarchy table with n-top levels, instead of just 1, as per the comments in your code. Wanted to make sure I was using the same testing methodology, so didn't play with your test-hierarchy-generator, in order to try these kinds of permutations.

    Of course, I'm not entirely sure why you'd use both breadcrumbs and nested sets. If you're already breadcrumbing, you can query that directly and avoid nested sets completely. Same performance as nested sets, in most cases, unless your path is really, really long (since it's a binary sort, until it goes over 8000 datalength; indexes do binary sorts very, very well).


    Thanks for the feedback, Gus. The bread crumbing was just to make the calculation of the Nested Sets easy and could actually be removed from the final table. I left it there for two reasons... 1) as a training artifact and 2) beccause just like you say, there's a use in those breadcrumbs that will be demmonstrated in part 2.

    So far as upgrading to 2008 or better to make use of the HierarchyID datatype goes, some people just don't have that option.


    --Jeff Moden
    "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

    First step towards the paradigm shift of writing Set Based code:
    Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

    "Change is inevitable. Change for the better is not." -- 04 August 2013
    (play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

    Helpful Links:
    How to post code problems
    How to post performance problems
    Post #1385479
    Posted Thursday, November 15, 2012 11:22 PM


    SSC-Dedicated

    SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

    Group: General Forum Members
    Last Login: Today @ 5:22 PM
    Points: 36,016, Visits: 30,308
    Thomas S. Trias (11/15/2012)
    Excellent article; I have used all three types of tree representations before. In one project, we used a materialized path, and I kind of wish I'd thought to use varbinary instead of a hex string. However, I just tested using an index on a materialized path as a hex string using 100000 rows of sample data generated with BuildLargeEmployeeTable, and the results were perhaps not so surprising.

    SELECT
    ISNULL(EmployeeID, 0) AS EmployeeID,
    ISNULL(ManagerID, 0) AS ManagerID,
    ISNULL(CONVERT(varchar(8000), sortpath, 2), '') as SortPath
    INTO Hierarchy2
    FROM Hierarchy

    ALTER TABLE Hierarchy2 ADD CONSTRAINT PK_Hierarchy2 PRIMARY KEY CLUSTERED (EmployeeID)

    CREATE INDEX SortPath ON Hierarchy2 (SortPath)

    SQL Server will use a string index on LIKE operations anchored at the begining, so this:

    SELECT * FROM Hierarchy2 WHERE SortPath LIKE '0000000100000002000000130000001F0000003F00000058000000650000009600000935%'

    ends up being more than three times more efficient than this:

    SELECT * FROM Hierarchy2 WHERE SUBSTRING(SortPath, 65, 8) = '00000935'

    Unfortunately, the index only supports the first 900 bytes of the SortPath. Also, since the Nested Sets provide the same benefits (fast ancestor / descendant access and sorting), and since the update to calculate them using the SortPath is going to touch every row anyway, going with the approach you've got looks best.

    I guess since the rebuild is so fast, you also don't need to consider using double-precision floating-point numbers as the bowers (which makes updating the bowers in place a bit more manageable, since you can keep subdividing the ranges up to a certain point).

    Keep up the great work!


    Thanks for the feedback, Thomas.

    On the index thing, it turns out that adding even the final indexes before the data is complete will actually slow the code down. As you point out, since we're updating the whole table with each pass, indexes serve no real benefit to performance until all is said and done.


    --Jeff Moden
    "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

    First step towards the paradigm shift of writing Set Based code:
    Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

    "Change is inevitable. Change for the better is not." -- 04 August 2013
    (play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

    Helpful Links:
    How to post code problems
    How to post performance problems
    Post #1385480
    Posted Saturday, November 17, 2012 7:23 PM


    SSCommitted

    SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

    Group: General Forum Members
    Last Login: Tuesday, January 15, 2013 11:11 AM
    Points: 1,945, Visits: 2,782
    To wit, even you write non-portable code. For example, the very first piece of code you offer in your fine book titled "Trees and Hierarchies in SQL for Smarties" (1st edition) has the following code on page 6.


    That code is Pascal and I think it still ports. I dropped in it the second edition in favor of SQL/PSM code.


    Books in Celko Series for Morgan-Kaufmann Publishing
    Analytics and OLAP in SQL
    Data and Databases: Concepts in Practice
    Data, Measurements and Standards in SQL
    SQL for Smarties
    SQL Programming Style
    SQL Puzzles and Answers
    Thinking in Sets
    Trees and Hierarchies in SQL
    Post #1386028
    Posted Thursday, November 22, 2012 12:26 PM
    Valued Member

    Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

    Group: General Forum Members
    Last Login: Wednesday, April 02, 2014 1:24 PM
    Points: 62, Visits: 723
    Theo Ekelmans (11/13/2012)
    LOL, gimme thor class hammer!!!

    The project i'm working on is a roof of concept SNMP MIB database, that is so fast it can do (near)realtime SNMP data packet to SQL data conversion

    Just in case your new to the extensible SNMP MIB tree:

    If E-mail addresses were OIDs... user@ordina.org would have been something like: user@ordina.enterprises.private.internet.dod.org.iso
    or in non-translated notation: user@99999.1.4.1.6.3.1

    except that we write the top-most part at the left: 1.3.6.1.4.1.99999.117.115.101.114

    An OID is just a unique key (within one managed device) for one piece of information Ensures vendors don't have conflicting OIDs

    OID corresponds to a label .1.3.6.1.2.1.1.5 => sysName (in windows this is the servername)

    The complete (translated) path: .iso.org.dod.internet.mgmt.mib-2.system.sysName

    The info needed to get from one to an other is "mib (database) files", that are supplied with each new device / software version.

    The provided MIB files all have a part of the ginormous jigsaw MIB tree, and that needs to be searched between 500 to 20.000 times a minute.

    Fun project indeed :)


    I tried this once, so I feel your pain.

    The best I got was to store the OIDs flat in an indexed VARCHAR(1000) column and my metadata sparsely around that. Given that OIDs are essentially a materialized-path hierarchy, anyway, I was able to do parent>child navigation by `WHERE oid LIKE @parent_oid + '%'` and child>parent by `WHERE oid = LEFT(@child_oid, LEN(@child_oid) - CHARINDEX('.', REVERSE(@child_oid)))` (plus/minus a 1, here or there!).

    Maybe you could get some benefit from the binary approaches Jeff is using instead of the string-based approach I once took. I can't remember if there is an upper limit to the size of each 'part' of an OID but, if I were to guess, it would be a 0-64K 'word', which would convert to SMALLINT and on to BINARY(2) well enough (if it is actually a 0-4G dword, you would need to convert to INT and on to BINARY(4)).

    As part of your MIB-loading routine, split the OIDs by '.' [Jeff's earlier article on string-splitting (and the great discussion around it) at http://www.sqlservercentral.com/articles/Tally+Table/72993/ might be of help here], then convert each word to SMALLINT, then BINARY(2) and store the materialized path as concatenations of BINARY(2) into a indexed VARBINARY(2000).

    With a materialized path of fixed-size binary elements instead of variable-length string elements, you could get 'level' metadata from `DATALENGTH(oid) / 2`, child>parent from `SUBSTRING(@oid, 1, DATALENGTH(@oid) - 2)` and parent>child from `WHERE SUBSTRING(oid, 1, DATALENGTH(@parent_oid)) = @parent_oid`.

    If binary sorts are that good (and people here give me every reason to think that they are), you may be performant enough at this point.
    Post #1387970
    « Prev Topic | Next Topic »

    Add to briefcase «««34567»»»

    Permissions Expand / Collapse