Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets

  • Jeff Moden (11/14/2012)I believe the question was actually about how to use the table from the article... not the HierarchyID.

    Oops, you're right

  • SQLRNNR (11/13/2012)


    Nice job Jeff.

    Thanks, Jason. I appreciate it. It looks so simple now but this one took a long time to research and tweak.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Smashing article Jeff. Binary sorts rule! I have been using these with recursive CTEs for a few years now and they rock!!!

    James
    MCM [@TheSQLPimp]

  • Michael Meierruth (11/14/2012)


    In fact, in place of '(select 1)' you can use anything that's not a constant, e.g. @@servername, getdate(), etc. (but for some strange reason it doesn't like @@datefirst)

    Yes, my favourite has always been @@spid because it's quick to type.

    In fact a table containing a single int column which is also an identity column cannot be filled using 'insert into'.

    Unless you "set identity_insert mytable on" of course.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Michael Meierruth (11/14/2012)

    In fact a table containing a single int column which is also an identity column cannot be filled using 'insert into'.

    You can use INSERT INTO thetable DEFAULT VALUES;

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • James A Skipwith (11/14/2012)


    Smashing article Jeff. Binary sorts rule! I have been using these with recursive CTEs for a few years now and they rock!!!

    Thanks for the great feedback, James!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • MM and Mark,

    This is what I was looking for.

    select top 1000 identity(int,1,1) as N

    into dbo.HTally

    from master.sys.all_columns ac1

    cross join master.sys.all_columns ac2

    But you would then have to do an update to set N=(N-1)*4+1.

    So row_number() on (order by (select @@spid)) is fine with me.

  • Michael Meierruth (11/14/2012)


    MM and Mark,

    This is what I was looking for.

    select top 1000 identity(int,1,1) as N

    into dbo.HTally

    from master.sys.all_columns ac1

    cross join master.sys.all_columns ac2

    But you would then have to do an update to set N=(N-1)*4+1.

    So row_number() on (order by (select @@spid)) is fine with me.

    you can use

    row_number() over (order by @@spid)

    you don't need

    row_number() over (order by (select @@spid))

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (11/14/2012)

    you can use

    row_number() over (order by @@spid)

    Yes, that's the one!

  • row_number() over (order by @@spid)

    I'm not against using obscure hacks, and there's always going to be a debate about the point at which you even classify something as an obscure hack. The discovery of a new, better way of doing things will start life as an obscure hack and eventually find its way into mainstream use. What I am against, is using obscure hacks without explaining what you're doing in your code. At some point Joe Rookie is going to inherit my code and is going to waste an awful lot of time working out why I'd ever order by @@spid if I haven't commented. I don't know where the line should be drawn regarding what to comment and what not to, but personally, I've always appreciated inheriting code where the original author went to a bit of effort.

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • GPO (11/14/2012)


    row_number() over (order by @@spid)

    I'm not against using obscure hacks, and there's always going to be a debate about the point at which you even classify something as an obscure hack. The discovery of a new, better way of doing things will start life as an obscure hack and eventually find its way into mainstream use. What I am against, is using obscure hacks without explaining what you're doing in your code. At some point Joe Rookie is going to inherit my code and is going to waste an awful lot of time working out why I'd ever order by @@spid if I haven't commented. I don't know where the line should be drawn regarding what to comment and what not to, but personally, I've always appreciated inheriting code where the original author went to a bit of effort.

    Each to their own, but to my mind, I see no difference between ordering by a static variable like @@SPID and a NULL set...both are hacks according to Books Online, which only allows for columns made available by the FROM clause...

    <ORDER BY Clause>

    Specifies the order to apply the ranking window function. For more information, see ORDER BY Clause (Transact-SQL).

    Important:

    When used in the context of a ranking window function, <ORDER BY Clause> can only refer to columns made available by the FROM clause. An integer cannot be specified to represent the position of the name or alias of a column in the select list.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • @mister.magoo

    both are hacks

    Yes I totally agree. I was just using the @@spid as an example. My point (that if you're using obscure hacks you should explain what you're doing in your code) applies equally to using NULL in that context.

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • Since I was out a few days, I missed this article till today.

    Interesting methodology.

    Testing it on my workstation, I got 45 seconds for a million-row hierarchy, 30 nodes deep.

    On 10k rows, 21 deep, took 593 milliseconds.

    Methods I posted in my article in August took 3 milliseconds at 10k rows, 29 seconds at 1M rows, same data. I'm using HierarchyIDs the same way you use your SortBy column. When I modified to Varbinary(max) instead of HiearchyID, it stays under 10 milliseconds for 10k rows, but goes up to 49 seconds for 1M rows, using those methods.

    In the Varbinary(max) version, 70-75% of the run-time was generating the breadcrumb path (SortBy), and 25-30% in generating the Nested Sets from it.

    Per my tests, HierarchyID will only easily go to a depth of a little more than 400 levels, so my methods won't work precisely on really deep hierarchies. Anything less than 400 levels, I'd still stick with HierarchyID, and seriously think about upgrading to a supported version of SQL Server if you can't use them. Not always possible, but it is worth it if you can.

    Didn't get a chance to try yours out with a hierarchy table with n-top levels, instead of just 1, as per the comments in your code. Wanted to make sure I was using the same testing methodology, so didn't play with your test-hierarchy-generator, in order to try these kinds of permutations.

    Of course, I'm not entirely sure why you'd use both breadcrumbs and nested sets. If you're already breadcrumbing, you can query that directly and avoid nested sets completely. Same performance as nested sets, in most cases, unless your path is really, really long (since it's a binary sort, until it goes over 8000 datalength; indexes do binary sorts very, very well).

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Excellent article; I have used all three types of tree representations before. In one project, we used a materialized path, and I kind of wish I'd thought to use varbinary instead of a hex string. However, I just tested using an index on a materialized path as a hex string using 100000 rows of sample data generated with BuildLargeEmployeeTable, and the results were perhaps not so surprising.

    SELECT

    ISNULL(EmployeeID, 0) AS EmployeeID,

    ISNULL(ManagerID, 0) AS ManagerID,

    ISNULL(CONVERT(varchar(8000), sortpath, 2), '') as SortPath

    INTO Hierarchy2

    FROM Hierarchy

    ALTER TABLE Hierarchy2 ADD CONSTRAINT PK_Hierarchy2 PRIMARY KEY CLUSTERED (EmployeeID)

    CREATE INDEX SortPath ON Hierarchy2 (SortPath)

    SQL Server will use a string index on LIKE operations anchored at the begining, so this:

    SELECT * FROM Hierarchy2 WHERE SortPath LIKE '0000000100000002000000130000001F0000003F00000058000000650000009600000935%'

    ends up being more than three times more efficient than this:

    SELECT * FROM Hierarchy2 WHERE SUBSTRING(SortPath, 65, 8) = '00000935'

    Unfortunately, the index only supports the first 900 bytes of the SortPath. Also, since the Nested Sets provide the same benefits (fast ancestor / descendant access and sorting), and since the update to calculate them using the SortPath is going to touch every row anyway, going with the approach you've got looks best. 🙂

    I guess since the rebuild is so fast, you also don't need to consider using double-precision floating-point numbers as the bowers (which makes updating the bowers in place a bit more manageable, since you can keep subdividing the ranges up to a certain point).

    Keep up the great work!

  • GSquared (11/15/2012)


    Since I was out a few days, I missed this article till today.

    Interesting methodology.

    Testing it on my workstation, I got 45 seconds for a million-row hierarchy, 30 nodes deep.

    On 10k rows, 21 deep, took 593 milliseconds.

    Methods I posted in my article in August took 3 milliseconds at 10k rows, 29 seconds at 1M rows, same data. I'm using HierarchyIDs the same way you use your SortBy column. When I modified to Varbinary(max) instead of HiearchyID, it stays under 10 milliseconds for 10k rows, but goes up to 49 seconds for 1M rows, using those methods.

    In the Varbinary(max) version, 70-75% of the run-time was generating the breadcrumb path (SortBy), and 25-30% in generating the Nested Sets from it.

    Per my tests, HierarchyID will only easily go to a depth of a little more than 400 levels, so my methods won't work precisely on really deep hierarchies. Anything less than 400 levels, I'd still stick with HierarchyID, and seriously think about upgrading to a supported version of SQL Server if you can't use them. Not always possible, but it is worth it if you can.

    Didn't get a chance to try yours out with a hierarchy table with n-top levels, instead of just 1, as per the comments in your code. Wanted to make sure I was using the same testing methodology, so didn't play with your test-hierarchy-generator, in order to try these kinds of permutations.

    Of course, I'm not entirely sure why you'd use both breadcrumbs and nested sets. If you're already breadcrumbing, you can query that directly and avoid nested sets completely. Same performance as nested sets, in most cases, unless your path is really, really long (since it's a binary sort, until it goes over 8000 datalength; indexes do binary sorts very, very well).

    Thanks for the feedback, Gus. The bread crumbing was just to make the calculation of the Nested Sets easy and could actually be removed from the final table. I left it there for two reasons... 1) as a training artifact and 2) beccause just like you say, there's a use in those breadcrumbs that will be demmonstrated in part 2.

    So far as upgrading to 2008 or better to make use of the HierarchyID datatype goes, some people just don't have that option.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Viewing 15 posts - 31 through 45 (of 99 total)

    You must be logged in to reply to this topic. Login to reply