Finding Hierarchical Groups From The Same Table

  • SCENARIO: I have a table which will potentially have millions of rows. At the moment it has 624560 rows. Table format is:

    CREATE TABLE NRGroups (ID_ int primary key, PID int, Grp int)

    I am trying to process the rows in form of groups, e.g.

    ID_ | PID | Grp

     1     -     0

     4     9     0

     3     6     0

     7     5     0

     9     7     0

     6     -     0

     5     1     0

     2     -     0

    1, 5, 7, 9, 4 belong to one group, 6, 3 belong to next group, 2 is the only member in its group.

    i.e.

    ID_ | PID | Grp

     1     -     1

     4     9     1

     3     6     2

     7     5     1

     9     7     1

     6     -     2

     5     1     1

     2     -     3

    ps: Not every row has a parent/child

    I have written SQL Server code to organise the groups, with one nested loop inside - the nested loop finds the group and the outer loop gets the next record to find the group for.

    PROBELM: When I limit the loop to process 10000s rows it sooner or later processes them but when I get it to process the lot i.e. 624560 rows it seems to take over an hour to put them into groups!! Is there anyway to speed this process??

    CODE:

    DECLARE     @iReturnCode       int,

                @iNextRowId        int,

                @iCurrentRowId     int,

                @iLoopControl      int,

                @count             int,

                @grp_num           int,

                @current           int,

                @lvl               tinyint

    SELECT @iNextRowId = MIN(ID_) FROM NRGroups

    SELECT @iLoopControl = 1

    SELECT @grp_num = 0

    WHILE @iLoopControl = 1 -- OUTER WHILE

       BEGIN

         ------------------FIND all group and UPDATE Grp field of NRGroups table----------------------------

            DECLARE @stack TABLE (item INT, lvl TINYINT)

            INSERT INTO @stack VALUES (@iNextRowId, 1)

            SELECT @lvl = 1

            SELECT @grp_num = @grp_num + 1

            WHILE @lvl > 0 -- INNER WHILE

               BEGIN

                  IF EXISTS (SELECT * FROM @stack WHERE lvl = @lvl)

                     BEGIN

                        SELECT @current = item FROM @stack WHERE lvl = @lvl

          UPDATE NRGroups set Grp = @grp_num where id_ = @current              

                        COMMIT

                        DELETE FROM @stack WHERE lvl = @lvl AND item = @current

         

                        INSERT @stack SELECT ID_, @lvl + 1 FROM NRGroups WHERE PID = @current

         

                        IF @@ROWCOUNT > 0

                           SELECT @lvl = @lvl + 1

                     END

                  ELSE

                     SELECT @lvl = @lvl - 1

            END -- INNER WHILE

         --------------------end of FIND-------------------------------------------------------------------

         -- Reset looping variable           

         SELECT @iCurrentRowId = @iNextRowId

         SELECT @iNextRowId = NULL           

         -- get the next Row_Id

         SELECT @iNextRowId = MIN(ID_) FROM NRGroups WHERE ID_ > @iCurrentRowId and Grp = 0

         -- did we get a valid next row id?

         IF ISNULL(@iNextRowId, 0) = 0

           BREAK

        

         --if @iNextRowId > 1000 break -- DEBUG CODE

    END-- OUTER WHILE

    COMMIT

  • Now you know why I cringe everytime one of my developers wants to do this! If you put a limit on how many levels deep you will support then you can do this in a set based way... IE: Right now you go to 4 levels deep in your data so you should be able to do a set based self join 4 levels deep to get the lowest levels group. Then do it for 3 levels and 2 levels. Over all it will be much faster than the loop you have above. Only problem is now you are limited to only 4 levels. Another option might be to create a user defined function that calls itself to get the group. Again I think you will find that the performance will be hurt.




    Gary Johnson
    Microsoft Natural Language Group
    DBA, Sr. DB Engineer

    This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.

  • Here is an example of what I mean.

    SET NOCOUNT ON

    IF EXISTS(SELECT * FROM sysobjects WHERE id = object_id('NRGroups'))

        DROP TABLE NRGroups

    CREATE TABLE NRGroups

        (ID_ int primary key, PID int, Grp int)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 1,NULL,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 4,9,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 3,6,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 7,5,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 9,7,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 6,NULL,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 5,1,0)

    INSERT INTO NRGroups(ID_,PID,Grp)

    VALUES( 2,NULL,0)

    -- Create a temp table to get GroupIDs

    IF EXISTS(SELECT * FROM tempdb..sysobjects WHERE id = object_id('tempdb..#Grps'))

        DROP TABLE #Grps

    CREATE TABLE #Grps

        (

        ID_    int

        ,GRPID  int identity(1,1)

        )

    INSERT INTO #Grps (ID_)

    SELECT ID_

    FROM NRGroups NRG

    WHERE NRG.PID IS NULL

    -- Update Grp for Top Levels to the grpid from #Grps.

    UPDATE NRG

    SET Grp = g.GRPID

    FROM NRGroups NRG

        JOIN #Grps g ON NRG.ID_ = g.ID_

    WHERE NRG.PID IS NULL

           

    -- SELECT * FROM NRGroups

    -- Update Grp for 5 levels deep.

    UPDATE N5

    SET Grp = NRG.Grp

    --SELECT *

    FROM NRGroups NRG

        JOIN NRGroups N2 ON NRG.ID_ = N2.PID AND NRG.PID IS NULL

            JOIN NRGroups N3 ON N2.ID_ = N3.PID

                JOIN NRGroups N4 ON N3.ID_ = N4.PID

                JOIN NRGroups N5 ON N4.ID_ = N5.PID

    WHERE N5.Grp = 0

               

    -- SELECT * FROM NRGroups 

    -- Update Grp for 4 levels deep.

    UPDATE N4

    SET Grp = NRG.Grp

    --SELECT *

    FROM NRGroups NRG

        JOIN NRGroups N2 ON NRG.ID_ = N2.PID AND NRG.PID IS NULL

            JOIN NRGroups N3 ON N2.ID_ = N3.PID

                JOIN NRGroups N4 ON N3.ID_ = N4.PID

    WHERE N4.Grp = 0

    -- SELECT * FROM NRGroups 

    -- Update Grp for 3 levels deep.

    UPDATE N3

    SET Grp = NRG.Grp

    FROM NRGroups NRG

        JOIN NRGroups N2 ON NRG.ID_ = N2.PID AND NRG.PID IS NULL

            JOIN NRGroups N3 ON N2.ID_ = N3.PID

    WHERE N3.Grp = 0

    -- SELECT * FROM NRGroups           

    -- Update Grp for 2 levels deep.

    UPDATE N2

    SET Grp = NRG.Grp

    FROM NRGroups NRG

        JOIN NRGroups N2 ON NRG.ID_ = N2.PID AND NRG.PID IS NULL

    WHERE N2.Grp = 0

           

    SELECT * FROM NRGroups

         




    Gary Johnson
    Microsoft Natural Language Group
    DBA, Sr. DB Engineer

    This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.

  • While Gary was posting....

    -- Example (No "level" limit) possibility

    -- May want to modify for performance & other reasons, but I hope the "jist" gets across

    -- Build test data

    If object_ID('NRGroupsTest') is not NULL Drop Table NRGroupsTest

    CREATE TABLE NRGroupsTest (ID_ int primary key, PID int, Grp int)

    insert Into NRGroupsTest ( ID_, PID, Grp)

        Select  1, NULL, 0  UNION ALL

        Select  4, 9, 0  UNION ALL

        Select  3, 6, 0  UNION ALL

        Select  7, 5, 0  UNION ALL

        Select  9, 7, 0  UNION ALL

        Select  6, NULL, 0  UNION ALL

        Select  5, 1, 0  UNION ALL

        Select  2, NULL, 0

    -- Clear #Temp Table

    If object_ID('Tempdb..#Temp') is not NULL Drop Table #Temp

    -- looping for each level

    Declare @DynSQL varchar(8000), @LVL Int, @Cnt Int

    Select @Cnt = 1

    While @Cnt > 0 Begin

        if @LVL Is Null Begin -- 1st time, root grp

            -- Make #Temp Table with "root" level records

            Select NRG.ID_ as ID, @LVL as LVL, identity(int, 1,1) Grp

            Into #Temp

            From NRGroupsTest NRG

            Where PID is NULL

        End Else Begin

        -- For Each Level add IDs to #Temp table

        Set Identity_Insert #Temp On -- so that data can be put into Grp Col

        Insert Into #Temp (ID, LVL, Grp)

            Select NRG.ID_, @LVL, T.Grp

            From NRGroupsTest NRG

            Join #Temp T

                On T.ID = NRG.PID and isNull(LVL, 0) = @LVL -1

        End

        -- If above did any work, @Cnt will be > 0, and loop continues

        Select @LVL = IsNull(@LVL, 0) + 1, @Cnt = IsNull(@LVL * 0 + @@rowcount, 1)

        Set Identity_Insert #Temp Off

    End

    -- for show purpose only

    select * from #Temp order by grp, ID

    /*

    -- Then

    -- Maybe

    Create Clustered Index #Temp on #Temp (ID, Grg)

    -- Update NRGroupsTest

    UPDATE NGR Set NGR.Grp = T.Grp

        From NRGroupsTest NGR

        join #Temp T On NGR.ID_ = T.ID

    */

    -- Cleanup

    If object_ID('Tempdb..#Temp') is not NULL Drop Table #Temp



    Once you understand the BITs, all the pieces come together

  • Very nice Thomas! I still think I would limit the depth level. Leaving it open will ultimately leave you in a bit of a performance and maintenance bind.




    Gary Johnson
    Microsoft Natural Language Group
    DBA, Sr. DB Engineer

    This posting is provided "AS IS" with no warranties, and confers no rights. The opinions expressed in this post are my own and may not reflect that of my employer.

  • Agree Gary, it can be simply limited with SELECT ... WHERE LVL < n in the ELSE portion.

    I was also thinking, maybe not use #Temp for all recs, but just to establish the "root" level GRP recs, Update base with those values, then UPDATE ... "Self Join" in the ELSE portion. It may run alot faster since the indecies on the base table can used / configured.

     



    Once you understand the BITs, all the pieces come together

  • Thankyou Gary and Thomas - I really appreciate you guys for taking the time and effort for assisting.

    I haven't tested Gary's solution but I have tested Thomas' piece of  code on 100000s rows of data and it is mega fast - that's without applying the changes (which Thomas has mentioned in his last message).

    Cheers

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply