Introducing the Set-based Loop

  • Comments posted to this topic are about the item Introducing the Set-based Loop

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hi Luis,

    Interesting technique, thanks for sharing.

    I must admit I'm surprised the CTE results were so much larger.

    I'm trying to understand the rules that exist in the sample data using the test script provided.

    As I understand it the Groups table contains all clients, their current group ID and their associated current membership ID for that group, yes?

    You say that a client can belong to multiple groups but does that mean that a client can belong to multiple groups at the same time or only at different times (the latter represented by multiple chained grouphistory records)?

    Each client will have at least one entry in grouphistory, and if they have moved groups the PreviousGroupId and PreviousMemberId fields would be filled in. This forms a self join to the clients previous history record. When we locate the oldest history record (i.e. the previous group / member id is null) we want to take these oldest group/member ids and insert them as other "originals" on the group table. Is that right?

    If I am understanding correctly I would have expected the following assumptions to hold true:

    SELECT COUNT(DISTINCT [ClientId]), COUNT(*)

    FROM [dbo].[Groups]

    --Expected result: both values match

    --Actual result: both values match

    SELECT COUNT(DISTINCT G.[ClientId]), COUNT(*)

    FROM dbo.Groups G

    JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId

    --Expected result: both values match

    --Actual result: both values do not match. Indicates many-to-many.

    SELECT MemberId,GroupId, COUNT(ClientId)

    FROM [dbo].[Groups]

    GROUP BY MemberId,GroupId

    HAVING COUNT(*) > 1

    --Expected result: no results

    --Actual result: 42K results. Also indicates many-to-many

    How can the original member/group ids be deterministically identified if the apparent many-to-many data relationships exist? Am i missing a rule or something?

    (The answer may well be that I'm missing something due to my current state of illness / lack of sleep! 😉 )

    Thanks,

    Sam

  • Hi Sam,

    Thank you for reading the article and commenting. You're absolutely right, there's an error with the generation for the sample data. There shouldn't be a many to many relationship for this example. Ironically, a very similar problem was present on the real data but I thought that it wasn't worth to include it in the article.

    Here's the correct sample data generator with a change on the loop that inserts levels 2 to 6.

    IF OBJECT_ID( 'dbo.GroupsHistory') IS NOT NULL --Prevent errors on reruns

    DROP TABLE dbo.GroupsHistory;

    CREATE TABLE dbo.GroupsHistory( --Create the table with relevant columns

    MemberId int NOT NULL,

    GroupId int NOT NULL,

    PreviousMemberId int NULL,

    PreviousGroupId int NULL,

    StartDate date NOT NULL,

    Level int NOT NULL --This column wasn't available on the original problem. It's here to help with test data generation.

    );

    IF OBJECT_ID( 'dbo.Groups') IS NOT NULL --Prevent errors on reruns

    DROP TABLE dbo.Groups;

    CREATE TABLE dbo.Groups( --Create the table with relevant columns

    ClientId int IDENTITY NOT NULL,

    MemberId int NOT NULL,

    GroupId int NOT NULL,

    OriginalGroupId int NULL,

    OriginalMemberId int NULL

    );

    --Insert first level of test data

    WITH E(n) AS(

    SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)

    ),

    TestData AS(

    SELECT ABS( CHECKSUM(NEWID())) % 100000 CurrentGroupId,

    DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, '20100101') StartDate

    FROM E a, E b, E c, E d, E f, E g

    )

    INSERT INTO GroupsHistory(MemberId, GroupId, StartDate, Level)

    SELECT ROW_NUMBER() OVER(PARTITION BY CurrentGroupId ORDER BY (SELECT NULL)),

    CurrentGroupId,

    StartDate,

    1

    FROM TestData;

    --Generate the next levels for test data

    DECLARE @Level int = 1;

    WHILE @Level < 6

    BEGIN

    WITH SubTestData AS(

    SELECT TOP 50 percent

    (ABS( CHECKSUM(NEWID())) % 100000) + ( @Level * 100000) AS GroupId,

    MemberId AS PreviousMemberId,

    GroupId AS PreviousGroupId,

    DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, StartDate) AS StartDate,

    @Level + 1 AS Level

    FROM dbo.GroupsHistory

    WHERE Level = @Level

    ORDER BY NEWID()

    )

    INSERT INTO GroupsHistory(MemberId, GroupId, PreviousMemberId, PreviousGroupId, StartDate, Level)

    SELECT TOP 50 percent

    ROW_NUMBER() OVER( PARTITION BY GroupId ORDER BY (SELECT NULL)),

    GroupId,

    PreviousMemberId,

    PreviousGroupId,

    StartDate,

    Level

    FROM SubTestData;

    SET @Level += 1;

    END;

    INSERT INTO dbo.Groups(GroupId, MemberId)

    SELECT GroupId, MemberId

    FROM dbo.GroupsHistory g

    WHERE NOT EXISTS(

    SELECT 1

    FROM dbo.GroupsHistory h

    WHERE g.GroupId = h.PreviousGroupId

    AND g.MemberId = h.PreviousMemberId);

    This changes the performance on the rCTE making it faster than the RBAR using indexes. However, I'm curious on how the performance was affected negatively for the RBAR solution.

    These timings were made on a different computer and shouldn't be compared directly to the ones posted on the article.

    Iteration With_Index RBAR_Seconds rCTE_seconds While_seconds

    -------------------- ---------- ------------ ------------ -------------

    1 0 NULL 86 11

    2 0 NULL 75 10

    3 0 NULL 77 11

    4 0 NULL 76 11

    5 0 NULL 74 12

    6 0 NULL 77 11

    7 0 NULL 80 11

    8 0 NULL 73 12

    9 0 NULL 79 11

    10 0 NULL 80 12

    NULL 0 NULL 77 11

    1 1 71 40 11

    2 1 71 42 9

    3 1 76 42 10

    4 1 70 42 10

    5 1 69 44 9

    6 1 74 41 10

    7 1 69 43 10

    8 1 70 43 12

    9 1 74 43 10

    10 1 75 42 10

    NULL 1 71 42 10

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Terrific article, and thanks for it! It nicely shows how plan optimisation influences run-times and efficiency.

  • GoofyGuy (7/27/2015)


    Terrific article, and thanks for it! It nicely shows how plan optimisation influences run-times and efficiency.

    Thank you for the feedback. 🙂

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations. Maybe it should be called SBAS. 😀

    I have used this method many a time to perform various tasks - and it is very useful in those cases. Even the use of a cursor is not always EVIL. It's a matter of having tested and documented as you instructed.

    Nice job!

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (7/27/2015)


    One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations.

    That was exactly the point of this article, to differentiate RBAR from set-based loops.

    SQLRNNR (7/27/2015)


    Nice job!

    Thank you.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (7/27/2015)


    SQLRNNR (7/27/2015)


    One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations.

    That was exactly the point of this article, to differentiate RBAR from set-based loops.

    Somehow I thought you were referring to the last three solutions as RBAR as well. Maybe it was the wording. Anywho...:Whistling:

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Good work indeed Luis, nicely done!

    😎

  • I also use this technique quite a bit. Besides being faster than the other methods I think it's easier to understand.

  • Hi Luis,

    It was a interesting article though highlighting the perception that has been taken up for the RBAR and set based data traversing in SQL. The major thing which i found highlighting is the Indexing that has been done on the table which makes the data parsing quicker which by default since the traverse will get the necessary information from the non clustered indexes, so ideally speaking is that any indexed huge data will support quick data access even RBAR is implemented?

    Correct me if i am wrong.

    Thanks

    With warm Regards

    Aswin Sankar M

  • Luis Cazares (7/27/2015)


    Hi Sam,

    Thank you for reading the article and commenting. You're absolutely right, there's an error with the generation for the sample data. There shouldn't be a many to many relationship for this example. Ironically, a very similar problem was present on the real data but I thought that it wasn't worth to include it in the article.

    ...

    This changes the performance on the rCTE making it faster than the RBAR using indexes. However, I'm curious on how the performance was affected negatively for the RBAR solution.

    These timings were made on a different computer and shouldn't be compared directly to the ones posted on the article.

    Hi Luis,

    Thanks for the quick update to the test data generator!

    I couldn't say why the RBAR solution was negatively affected (though I wonder if some index fragmentation might have crept in), but I have made an attempt at improving on the CTE approach.

    I've run out of spare time to be more thorough but the best I could come up with (for using a CTE to do the recursion) was taking 18 seconds to execute.

    (This is on a modest VM I have handy; but which achieved similar 17 second results to your own tests for the SBAS approach).

    But I did resort to some trickery:

    I split the process into two statements within a transaction. Using the rCTE for the recursion and followed with a simple update for the remainder.

    I also used a different pair of indexes and a join hint to bring the execution time down by a couple more seconds.

    (Note that I managed ~30 second execution using the rCTE for the full data set)

    I note that the test data set this time around resulted in 75% of clients having no history so this approach is going to slow down (relatively to other approaches) depending on the skew of percent of total clients with history. This is because the simple update is faster than the recursive step.

    I do agree that your SBAS technique is going to be the most efficient/appropriate for the requirements at hand because it avoids some otherwise unavoidable costs, I just wanted to give the CTE a fighting chance.

    Here's my code:

    CREATE UNIQUE CLUSTERED INDEX UCI_GroupsHistory ON dbo.GroupsHistory(MemberId ASC, GroupId ASC);

    CREATE UNIQUE CLUSTERED INDEX CI_Groups ON dbo.Groups(MemberId ASC, GroupId ASC);

    DECLARE

    @SecondsrCTE int,

    @Time datetime;

    --Set the timer

    SET @Time = GETDATE();

    BEGIN TRAN;

    WITH rCTE_TD AS (

    SELECT

    GH_c.MemberId AS JoinMemberId,

    GH_c.GroupId AS JoinGroupId,

    GH_c.MemberId AS LevelMemberId,

    GH_c.GroupId AS LevelGroupId,

    GH_c.PreviousMemberId,

    GH_c.PreviousGroupId

    FROM

    dbo.GroupsHistory GH_c

    WHERE EXISTS (

    SELECT 1

    FROM dbo.Groups G

    WHERE G.MemberId = GH_c.MemberId

    AND G.GroupId = GH_c.GroupId

    )

    ANDGH_c.PreviousMemberId IS NOT NULL

    UNION ALL

    SELECT

    R.JoinMemberId,

    R.JoinGroupId,

    GH_p.MemberId,

    GH_p.GroupId,

    GH_p.PreviousMemberId,

    GH_p.PreviousGroupId

    FROM

    rCTE_TD R

    INNER JOIN

    dbo.GroupsHistory GH_p

    ON R.PreviousMemberId = GH_p.MemberId

    AND R.PreviousGroupId = GH_p.GroupId

    )

    UPDATE G

    SET

    G.OriginalGroupId = rCTE_TD.LevelGroupId,

    G.OriginalMemberId = rCTE_TD.LevelMemberId

    FROM Groups G

    INNER HASH JOIN rCTE_TD

    ON rCTE_TD.JoinMemberId = G.MemberId

    AND rCTE_TD.JoinGroupId = G.GroupId

    WHERE rCTE_TD.PreviousMemberId IS NULL --Assume these are the originals

    UPDATE G

    SET

    G.OriginalGroupId = GH.GroupId,

    G.OriginalMemberId = GH.MemberId

    FROM

    dbo.GroupsHistory GH

    INNER JOIN

    dbo.Groups G

    ON G.MemberId = GH.MemberId

    AND G.GroupId = GH.GroupId

    ANDGH.PreviousMemberId IS NULL

    COMMIT TRAN;

    --Get the time in seconds

    SET @SecondsrCTE = DATEDIFF(ss, @Time, GETDATE());

    SELECT @SecondsrCTE

    If I get some more time I'll come back and post comparison results using the test harness.

    Thanks,

    Sam

  • Luis,

    Why doesn't the GroupsHistory table have a ClientId column?

    Why this wierd hierarchy - which it really isn't? It's more like a series.

  • Thanks for your great article.:-)

    I have a problem with three “While”-loops and was searching for a better solution.

    Instead of using the “The problematic solution” now I use your “Set-based Loop”. Thanks for bringing this Idea back in my mind.:-)

    But I have some trouble with your article.

    First Point: (it deals with the comment from sam.dahl (old Hand))

    Code:

    select COUNT ([Level]), MemberId, GroupId from GroupsHistory

    group by MemberId, GroupId

    order by 1 desc

    I got some combinations MemberID, GroupId, where Count was greater 1. I select one of them to understand your code.

    select * from GroupsHistory where GroupId=248498 and memberid=2

    MemberIdGroupIdPreviousMemberIdPreviousGroupIdStartDateLevel

    2248498151341932012-05-043

    224849811352152010-12-253

    224849821815002012-01-133

    224849851381082011-08-213

    224849881364762011-09-203

    224849861629722011-08-113

    224849831828602011-11-263

    2248498111729962011-12-253

    select * from GroupsHistory where PreviousGroupId=248498 and PreviousMemberId=2

    MemberIdGroupIdPreviousMemberIdPreviousGroupIdStartDateLevel

    632496522484982012-03-274

    432478122484982011-12-294

    1030978922484982011-12-164

    331444522484982012-11-084

    832387522484982012-08-074

    930877522484982012-07-034

    If I have a client with MemberID=6 and GroupID= 324965 in the next level MemberID =2 and GroupID = 248498.

    But was is the PreviousPreviousMemberID or PreviousPreviousGroupID in the “The problematic solution” or the “Set-based Loop” ?

    Second Point: instead of using the table “Groups “ in the “Set-based Loop”, I created a table Groups2. Both solutions should give the same resultset.But when I try this

    select * from groups2 g2 inner join groups g1 on g1.clientid =g2.clientId where g1.OriginalMemberId<>g2.OriginalMemberId

    I got 116 000 records. So there are some problems in the data. I can fix two problems (sbl= Set-based Loop; ps= problematic solution)

    1.Resultset for client 9015: 901552158695661559015521586912921812

    select * from GroupsHistory where GroupId = 215869 and memberid=5 -- 2 solutions level 3

    select * from GroupsHistory where GroupId = 132273 and memberid=5 -- level 2; sbl

    select * from GroupsHistory where GroupId = 129218 and memberid=12 -- level 2; ps finished

    select * from GroupsHistory where GroupId = 56615 and memberid=5 -- level 1; sbl finished

    select * from GroupsHistory where GroupId = 98116 and memberid=12 -- level 1; here is no ps (why?)

    Both solutions uses different branches in die History-Table. The problematic solution goes only one step down.

    2.Resultset for client 530537: 530537831067447018553053783106742490621

    select * from GroupsHistory where GroupId = 310674 and memberid=8 -- level 4

    select * from GroupsHistory where GroupId = 249062 and memberid=1 -- level 3 ps finished

    select * from GroupsHistory where GroupId = 148545 and memberid=5 -- 2 at level 2

    select * from GroupsHistory where GroupId = 47018 and memberid=5 -- level 1 sbl finished

    select * from GroupsHistory where GroupId = 75661 and memberid=5 -- level 1

    The problematic solution goes only one step down. The set-based Loop finished in both examples at level 1.

    In the problematic solution the code to set the variable to NULL is at a wrong position, so the code leave the While-Branch to early.

    Your Code:

    WHILE EXISTS(

    SELECT h.GroupId, h.MemberId

    FROM dbo.GroupsHistory h

    WHERE h.GroupId = @GroupId

    AND h.MemberId = @MemberId)

    BEGIN

    SELECT @GroupId = PreviousGroupId,

    @MemberId = PreviousMemberId

    FROM GroupsHistory h

    WHERE h.GroupId = @GroupId

    AND h.MemberId = @MemberId;

    IF @GroupId IS NULL

    BREAK;

    UPDATE g SET

    OriginalGroupId = @GroupId,

    OriginalMemberId = @MemberId

    FROM Groups g

    WHERE ClientId = @ClientId;

    SET @GroupId = NULL; --wrong position

    SET @MemberId = NULL; --wrong position

    END

  • also the new testdata gives 60000 records for:

    onclick:if_IFCode('','');

    select * from groups2 g2 inner join groups g1 on g1.clientid =g2.clientId where g1.OriginalMemberId<>g2.OriginalMemberId

Viewing 15 posts - 1 through 15 (of 49 total)

You must be logged in to reply to this topic. Login to reply