Introducing the Set-based Loop

Question

Post reply

Introducing the Set-based Loop

Luis Cazares

SSC Guru

Points: 183706
More actions
July 26, 2015 at 6:28 am

#398803

Comments posted to this topic are about the item Introducing the Set-based Loop
Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?
How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 1 through 15 (of 50 total)

You must be logged in to reply to this topic. Login to reply

sam.dahl Hall of Fame Points: 3112 More actions · Answer 1

Hi Luis,

Interesting technique, thanks for sharing.

I must admit I'm surprised the CTE results were so much larger.

I'm trying to understand the rules that exist in the sample data using the test script provided.

As I understand it the Groups table contains all clients, their current group ID and their associated current membership ID for that group, yes?

You say that a client can belong to multiple groups but does that mean that a client can belong to multiple groups at the same time or only at different times (the latter represented by multiple chained grouphistory records)?

Each client will have at least one entry in grouphistory, and if they have moved groups the PreviousGroupId and PreviousMemberId fields would be filled in. This forms a self join to the clients previous history record. When we locate the oldest history record (i.e. the previous group / member id is null) we want to take these oldest group/member ids and insert them as other "originals" on the group table. Is that right?

If I am understanding correctly I would have expected the following assumptions to hold true:

SELECT COUNT(DISTINCT [ClientId]), COUNT(*)

FROM [dbo].[Groups]

--Expected result: both values match

--Actual result: both values match

SELECT COUNT(DISTINCT G.[ClientId]), COUNT(*)

FROM dbo.Groups G

JOIN dbo.GroupsHistory h ON g.GroupId = h.GroupId AND g.MemberId = h.MemberId

--Expected result: both values match

--Actual result: both values do not match. Indicates many-to-many.

SELECT MemberId,GroupId, COUNT(ClientId)

FROM [dbo].[Groups]

GROUP BY MemberId,GroupId

HAVING COUNT(*) > 1

--Expected result: no results

--Actual result: 42K results. Also indicates many-to-many

How can the original member/group ids be deterministically identified if the apparent many-to-many data relationships exist? Am i missing a rule or something?

(The answer may well be that I'm missing something due to my current state of illness / lack of sleep! )

Thanks,

Sam

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

Hi Sam,

Thank you for reading the article and commenting. You're absolutely right, there's an error with the generation for the sample data. There shouldn't be a many to many relationship for this example. Ironically, a very similar problem was present on the real data but I thought that it wasn't worth to include it in the article.

Here's the correct sample data generator with a change on the loop that inserts levels 2 to 6.

IF OBJECT_ID( 'dbo.GroupsHistory') IS NOT NULL --Prevent errors on reruns

DROP TABLE dbo.GroupsHistory;

CREATE TABLE dbo.GroupsHistory( --Create the table with relevant columns

MemberId int NOT NULL,

GroupId int NOT NULL,

PreviousMemberId int NULL,

PreviousGroupId int NULL,

StartDate date NOT NULL,

Level int NOT NULL --This column wasn't available on the original problem. It's here to help with test data generation.

);

IF OBJECT_ID( 'dbo.Groups') IS NOT NULL --Prevent errors on reruns

DROP TABLE dbo.Groups;

CREATE TABLE dbo.Groups( --Create the table with relevant columns

ClientId int IDENTITY NOT NULL,

MemberId int NOT NULL,

GroupId int NOT NULL,

OriginalGroupId int NULL,

OriginalMemberId int NULL

);

--Insert first level of test data

WITH E(n) AS(

SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0))E(n)

),

TestData AS(

SELECT ABS( CHECKSUM(NEWID())) % 100000 CurrentGroupId,

DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, '20100101') StartDate

FROM E a, E b, E c, E d, E f, E g

)

INSERT INTO GroupsHistory(MemberId, GroupId, StartDate, Level)

SELECT ROW_NUMBER() OVER(PARTITION BY CurrentGroupId ORDER BY (SELECT NULL)),

CurrentGroupId,

StartDate,

1

FROM TestData;

--Generate the next levels for test data

DECLARE @Level int = 1;

WHILE @Level < 6

BEGIN

WITH SubTestData AS(

SELECT TOP 50 percent

(ABS( CHECKSUM(NEWID())) % 100000) + ( @Level * 100000) AS GroupId,

MemberId AS PreviousMemberId,

GroupId AS PreviousGroupId,

DATEADD( dd, ABS( CHECKSUM(NEWID())) % 365, StartDate) AS StartDate,

@Level + 1 AS Level

FROM dbo.GroupsHistory

WHERE Level = @Level

ORDER BY NEWID()

)

INSERT INTO GroupsHistory(MemberId, GroupId, PreviousMemberId, PreviousGroupId, StartDate, Level)

SELECT TOP 50 percent

ROW_NUMBER() OVER( PARTITION BY GroupId ORDER BY (SELECT NULL)),

GroupId,

PreviousMemberId,

PreviousGroupId,

StartDate,

Level

FROM SubTestData;

SET @Level += 1;

END;

INSERT INTO dbo.Groups(GroupId, MemberId)

SELECT GroupId, MemberId

FROM dbo.GroupsHistory g

WHERE NOT EXISTS(

SELECT 1

FROM dbo.GroupsHistory h

WHERE g.GroupId = h.PreviousGroupId

AND g.MemberId = h.PreviousMemberId);

This changes the performance on the rCTE making it faster than the RBAR using indexes. However, I'm curious on how the performance was affected negatively for the RBAR solution.

These timings were made on a different computer and shouldn't be compared directly to the ones posted on the article.

Iteration With_Index RBAR_Seconds rCTE_seconds While_seconds

-------------------- ---------- ------------ ------------ -------------

1 0 NULL 86 11

2 0 NULL 75 10

3 0 NULL 77 11

4 0 NULL 76 11

5 0 NULL 74 12

6 0 NULL 77 11

7 0 NULL 80 11

8 0 NULL 73 12

9 0 NULL 79 11

10 0 NULL 80 12

NULL 0 NULL 77 11

1 1 71 40 11

2 1 71 42 9

3 1 76 42 10

4 1 70 42 10

5 1 69 44 9

6 1 74 41 10

7 1 69 43 10

8 1 70 43 12

9 1 74 43 10

10 1 75 42 10

NULL 1 71 42 10

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

GoofyGuy SSCertifiable Points: 6109 More actions · Answer 3

Terrific article, and thanks for it! It nicely shows how plan optimisation influences run-times and efficiency.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 4

GoofyGuy (7/27/2015)
Terrific article, and thanks for it! It nicely shows how plan optimisation influences run-times and efficiency.

Thank you for the feedback.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

SQLRNNR SSC Guru Points: 281344 More actions · Answer 5

One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations. Maybe it should be called SBAS.

I have used this method many a time to perform various tasks - and it is very useful in those cases. Even the use of a cursor is not always EVIL. It's a matter of having tested and documented as you instructed.

Nice job!

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Luis Cazares SSC Guru Points: 183706 More actions · Answer 6

SQLRNNR (7/27/2015)
One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations.

That was exactly the point of this article, to differentiate RBAR from set-based loops.

SQLRNNR (7/27/2015)
Nice job!

Thank you.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

SQLRNNR SSC Guru Points: 281344 More actions · Answer 7

Luis Cazares (7/27/2015)
SQLRNNR (7/27/2015)
One niggle. The looping-set update is not technically RBAR since it is performing groups of set-based operations.
That was exactly the point of this article, to differentiate RBAR from set-based loops.

Somehow I thought you were referring to the last three solutions as RBAR as well. Maybe it was the wording. Anywho...:Whistling:

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Eirikur Eiriksson SSC Guru Points: 182948 More actions · Answer 8

Eirikur Eiriksson

SSC Guru

Points: 182948

July 27, 2015 at 2:11 pm

#1815250

Good work indeed Luis, nicely done!

John Hanrahan Hall of Fame Points: 3825 More actions · Answer 9

I also use this technique quite a bit. Besides being faster than the other methods I think it's easier to understand.

aswin.iyer91 SSC Journeyman Points: 75 More actions · Answer 10

Hi Luis,

It was a interesting article though highlighting the perception that has been taken up for the RBAR and set based data traversing in SQL. The major thing which i found highlighting is the Indexing that has been done on the table which makes the data parsing quicker which by default since the traverse will get the necessary information from the non clustered indexes, so ideally speaking is that any indexed huge data will support quick data access even RBAR is implemented?

Correct me if i am wrong.

Thanks

With warm Regards

Aswin Sankar M

sam.dahl Hall of Fame Points: 3112 More actions · Answer 11

Luis Cazares (7/27/2015)
Hi Sam,
Thank you for reading the article and commenting. You're absolutely right, there's an error with the generation for the sample data. There shouldn't be a many to many relationship for this example. Ironically, a very similar problem was present on the real data but I thought that it wasn't worth to include it in the article.
...
This changes the performance on the rCTE making it faster than the RBAR using indexes. However, I'm curious on how the performance was affected negatively for the RBAR solution.
These timings were made on a different computer and shouldn't be compared directly to the ones posted on the article.

Hi Luis,

Thanks for the quick update to the test data generator!

I couldn't say why the RBAR solution was negatively affected (though I wonder if some index fragmentation might have crept in), but I have made an attempt at improving on the CTE approach.

I've run out of spare time to be more thorough but the best I could come up with (for using a CTE to do the recursion) was taking 18 seconds to execute.

(This is on a modest VM I have handy; but which achieved similar 17 second results to your own tests for the SBAS approach).

But I did resort to some trickery:

I split the process into two statements within a transaction. Using the rCTE for the recursion and followed with a simple update for the remainder.

I also used a different pair of indexes and a join hint to bring the execution time down by a couple more seconds.

(Note that I managed ~30 second execution using the rCTE for the full data set)

I note that the test data set this time around resulted in 75% of clients having no history so this approach is going to slow down (relatively to other approaches) depending on the skew of percent of total clients with history. This is because the simple update is faster than the recursive step.

I do agree that your SBAS technique is going to be the most efficient/appropriate for the requirements at hand because it avoids some otherwise unavoidable costs, I just wanted to give the CTE a fighting chance.

Here's my code:

CREATE UNIQUE CLUSTERED INDEX UCI_GroupsHistory ON dbo.GroupsHistory(MemberId ASC, GroupId ASC);

CREATE UNIQUE CLUSTERED INDEX CI_Groups ON dbo.Groups(MemberId ASC, GroupId ASC);

DECLARE

@SecondsrCTE int,

@Time datetime;

--Set the timer

SET @Time = GETDATE();

BEGIN TRAN;

WITH rCTE_TD AS (

SELECT

GH_c.MemberId AS JoinMemberId,

GH_c.GroupId AS JoinGroupId,

GH_c.MemberId AS LevelMemberId,

GH_c.GroupId AS LevelGroupId,

GH_c.PreviousMemberId,

GH_c.PreviousGroupId

FROM

dbo.GroupsHistory GH_c

WHERE EXISTS (

SELECT 1

FROM dbo.Groups G

WHERE G.MemberId = GH_c.MemberId

AND G.GroupId = GH_c.GroupId

)

ANDGH_c.PreviousMemberId IS NOT NULL

UNION ALL

SELECT

R.JoinMemberId,

R.JoinGroupId,

GH_p.MemberId,

GH_p.GroupId,

GH_p.PreviousMemberId,

GH_p.PreviousGroupId

FROM

rCTE_TD R

INNER JOIN

dbo.GroupsHistory GH_p

ON R.PreviousMemberId = GH_p.MemberId

AND R.PreviousGroupId = GH_p.GroupId

)

UPDATE G

SET

G.OriginalGroupId = rCTE_TD.LevelGroupId,

G.OriginalMemberId = rCTE_TD.LevelMemberId

FROM Groups G

INNER HASH JOIN rCTE_TD

ON rCTE_TD.JoinMemberId = G.MemberId

AND rCTE_TD.JoinGroupId = G.GroupId

WHERE rCTE_TD.PreviousMemberId IS NULL --Assume these are the originals

UPDATE G

SET

G.OriginalGroupId = GH.GroupId,

G.OriginalMemberId = GH.MemberId

FROM

dbo.GroupsHistory GH

INNER JOIN

dbo.Groups G

ON G.MemberId = GH.MemberId

AND G.GroupId = GH.GroupId

ANDGH.PreviousMemberId IS NULL

COMMIT TRAN;

--Get the time in seconds

SET @SecondsrCTE = DATEDIFF(ss, @Time, GETDATE());

SELECT @SecondsrCTE

If I get some more time I'll come back and post comparison results using the test harness.

Thanks,

Sam

Michael Meierruth SSChampion Points: 10051 More actions · Answer 12

Luis,

Why doesn't the GroupsHistory table have a ClientId column?

Why this wierd hierarchy - which it really isn't? It's more like a series.

karlheinz.ulonska Old Hand Points: 328 More actions · Answer 13

Thanks for your great article.:-)

I have a problem with three “While”-loops and was searching for a better solution.

Instead of using the “The problematic solution” now I use your “Set-based Loop”. Thanks for bringing this Idea back in my mind.:-)

But I have some trouble with your article.

First Point: (it deals with the comment from sam.dahl (old Hand))

Code:

select COUNT ([Level]), MemberId, GroupId from GroupsHistory

group by MemberId, GroupId

order by 1 desc

I got some combinations MemberID, GroupId, where Count was greater 1. I select one of them to understand your code.

select * from GroupsHistory where GroupId=248498 and memberid=2

MemberIdGroupIdPreviousMemberIdPreviousGroupIdStartDateLevel

2248498151341932012-05-043

224849811352152010-12-253

224849821815002012-01-133

224849851381082011-08-213

224849881364762011-09-203

224849861629722011-08-113

224849831828602011-11-263

2248498111729962011-12-253

select * from GroupsHistory where PreviousGroupId=248498 and PreviousMemberId=2

MemberIdGroupIdPreviousMemberIdPreviousGroupIdStartDateLevel

632496522484982012-03-274

432478122484982011-12-294

1030978922484982011-12-164

331444522484982012-11-084

832387522484982012-08-074

930877522484982012-07-034

If I have a client with MemberID=6 and GroupID= 324965 in the next level MemberID =2 and GroupID = 248498.

But was is the PreviousPreviousMemberID or PreviousPreviousGroupID in the “The problematic solution” or the “Set-based Loop” ?

Second Point: instead of using the table “Groups “ in the “Set-based Loop”, I created a table Groups2. Both solutions should give the same resultset.But when I try this

select * from groups2 g2 inner join groups g1 on g1.clientid =g2.clientId where g1.OriginalMemberId<>g2.OriginalMemberId

I got 116 000 records. So there are some problems in the data. I can fix two problems (sbl= Set-based Loop; ps= problematic solution)

1.Resultset for client 9015: 901552158695661559015521586912921812

select * from GroupsHistory where GroupId = 215869 and memberid=5 -- 2 solutions level 3

select * from GroupsHistory where GroupId = 132273 and memberid=5 -- level 2; sbl

select * from GroupsHistory where GroupId = 129218 and memberid=12 -- level 2; ps finished

select * from GroupsHistory where GroupId = 56615 and memberid=5 -- level 1; sbl finished

select * from GroupsHistory where GroupId = 98116 and memberid=12 -- level 1; here is no ps (why?)

Both solutions uses different branches in die History-Table. The problematic solution goes only one step down.

2.Resultset for client 530537: 530537831067447018553053783106742490621

select * from GroupsHistory where GroupId = 310674 and memberid=8 -- level 4

select * from GroupsHistory where GroupId = 249062 and memberid=1 -- level 3 ps finished

select * from GroupsHistory where GroupId = 148545 and memberid=5 -- 2 at level 2

select * from GroupsHistory where GroupId = 47018 and memberid=5 -- level 1 sbl finished

select * from GroupsHistory where GroupId = 75661 and memberid=5 -- level 1

The problematic solution goes only one step down. The set-based Loop finished in both examples at level 1.

In the problematic solution the code to set the variable to NULL is at a wrong position, so the code leave the While-Branch to early.

Your Code:

WHILE EXISTS(

SELECT h.GroupId, h.MemberId

FROM dbo.GroupsHistory h

WHERE h.GroupId = @GroupId

AND h.MemberId = @MemberId)

BEGIN

SELECT @GroupId = PreviousGroupId,

@MemberId = PreviousMemberId

FROM GroupsHistory h

WHERE h.GroupId = @GroupId

AND h.MemberId = @MemberId;

IF @GroupId IS NULL

BREAK;

UPDATE g SET

OriginalGroupId = @GroupId,

OriginalMemberId = @MemberId

FROM Groups g

WHERE ClientId = @ClientId;

SET @GroupId = NULL; --wrong position

SET @MemberId = NULL; --wrong position

END

karlheinz.ulonska Old Hand Points: 328 More actions · Answer 14

also the new testdata gives 60000 records for:

onclick:if_IFCode('','');

select * from groups2 g2 inner join groups g1 on g1.clientid =g2.clientId where g1.OriginalMemberId<>g2.OriginalMemberId

Introducing the Set-based Loop

Cookies on SQLServerCentral