Random Assignment of Groups for Simulation

Question

Random Assignment of Groups for Simulation

Andrew Geppert

SSC Veteran

Points: 247
More actions
March 12, 2006 at 9:57 pm

#152822

I am working on a simulation which takes 120-300 records, randomly assigns them to 10 arbitrary groups and then performs a statistical analysis. The statistical analysis is not problematic, but the random group assignment requires some additional work. Note: the groups must be evenly distributed as in each group should have the same number of records in it. For example, if there were 100 records and ten groups, each group would have ten records. I would also like to use the best random number generator possible. Finally, this portion of the simulation exercise will be repeated for 50K to 1M iterations (TBD). As such, having the most performant solution, would be best.
My current solution is to assign a random number to each row, sort the recordset on the random number, and assign the groups based on rank. I can only imagine that there is a better solution.
I have found this article insightful:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqlpro04/html/sp04c1.asp
The above article also highlights a key problem with the Rand() function in that Rand() is only evaluated once per query AND that Rand() cannot be used in a UDF. Even in SQL Server 2005, use of Rand() in a UDF returns the following: "Invalid use of side-effecting or time-dependent operator in 'rand' within a function." I have not experimented with a .Net assembly yet.
I also did not know if it would be worthwhile to use the SSIS "Row Sampling Transformation" in an iterative fashion to perform the sampling. The only catch is that all records have to be assigned to a group. Therefore, only picking one sample does not help.
Any ideas?
Note: I am using SQL Server 2005 and can also make use of .Net CLR.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004619 More actions · Answer 1

Here's a self standing sample that you may run with impunity...

This turned out to be a bit long winded because an ORDER BY in an UPDATE in the presence of IS NULL just doesn't sort correctly. So, I had to revert to a temporary table to hold the randomized sample...

In order to make the "perfect" distribution you wanted with a possibility that the number of records may not be evenly divisible by the number of groups, I had to make two distributions... one for the "perfect" distribution, and one for the stragglers.

Sorry it's not 100% loopless although each loop (1 for each group) is set based. It's still pretty fast. I haven't played with some of the new ranking features of 2005 but this may give you an idea...

I sure do hope this is what you meant... lemme know, eh?

--========================================================

-- Setup for the experiment

--========================================================

--===== Supress the autodisplay of rowcounts for appearance

SET NOCOUNT ON

--===== Make sure the temp table doesn't exist so can run

-- more than once

IF OBJECT_ID('TempDB..#Records') IS NOT NULL

DROP TABLE #Records

--===== This simulates some group of records from 120-300

-- I use temp tables for these type of experiments.

-- You would use permanent tables.

SELECT TOP 123 --<< Change this number to simulate diff group sizes

IDENTITY(INT,1,1) AS RecNo,

CAST(NULL AS INT) AS GroupID

INTO #Records

FROM Master.dbo.SysComments sc1,

Master.dbo.SysComments sc2

--========================================================

-- All set... demo the solution

-- This makes perfect distributions. If the number

-- of records is not an even product of 10, the

-- remaining records are NOT assigned...

--========================================================

--===== Create a "sample" table to hold random RecNo's

-- that will will be updated with the same GroupID.

-- This is necessary because including an ORDER BY

-- in a sub-query of an update is basically useless

-- when you look for IS NULL... it always returns

-- things in order by RecNo for some reason.

IF OBJECT_ID('TempDB..#Sample') IS NOT NULL

DROP TABLE #Sample

CREATE TABLE #SAMPLE (RecNo INT)

--===== Create a variable to hold the number of groups

DECLARE @Groups INT --<< parameterize this if making a proc

SET @Groups = 10

--===== Determine the "perfect" number of records for each group

DECLARE @GroupSize INT

SELECT @GroupSize = COUNT(*)/@Groups

FROM #Records

--===== This is our "distribution/group" counter

DECLARE @CurrentGroup INT

--===== This forces the update to only update a certain

-- number of records based on group size

SET ROWCOUNT @GroupSize

--===== Perform the "perfect" distribution randomly

SET @CurrentGroup = 1

WHILE @CurrentGroup <= @Groups

BEGIN

TRUNCATE TABLE #Sample

INSERT INTO #Sample (RecNO)

SELECT RecNO

FROM #Records

WHERE GroupID IS NULL

ORDER BY NEWID()

UPDATE #Records

SET GroupID = @CurrentGroup

FROM #Records r,

#Sample s

WHERE r.RecNo=s.RecNo

SET @CurrentGroup = @CurrentGroup+1

END

--========================================================

-- This picks up the rest of the records... there

-- will always be less than the group of 10 here so

-- each record must be given a different number. The

-- lower groups are favored here.

--========================================================

--===== This forces the update to only affect 1 row per

-- loop

SET ROWCOUNT 1

--===== Perform the remaining distribution, 1 rec per group,

-- randomly by RecNo but favors the lower numbered groups.

SET @CurrentGroup = 1

WHILE @CurrentGroup <= @Groups

BEGIN

--===== Create the random sample

TRUNCATE TABLE #Sample

INSERT INTO #Sample (RecNO)

SELECT RecNo

FROM #Records

WHERE GroupID IS NULL

ORDER BY NEWID()

--===== Distribute the random sample

UPDATE #Records

SET GroupID = @CurrentGroup

FROM #Records r,

#Sample s

WHERE r.RecNo=s.RecNo

SET @CurrentGroup = @CurrentGroup+1

END

--===== All done... put things back to normal

SET ROWCOUNT 0

--===== Display the membership count by GroupID

SELECT 'This is the distribution of GroupID''s.'

SELECT GroupID,COUNT(*) AS MembershipCount

FROM #Records

GROUP BY GroupID

--===== Display the records and the assigned groups

SELECT 'Here are the records and the randomized distributed groups.'

SELECT * FROM #Records

p.s. The "Best" random number generator is the NEWID() function.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

noeld SSC Guru Points: 96590 More actions · Answer 2

I do agree with Jeff that for perromance sake you should use the newID but if you need ramdoness quality rand() is the way to go.

You and that article said:

"The above article also highlights a key problem with the Rand() function in that Rand() is only evaluated once per query AND that Rand() cannot be used in a UDF"

That is not entirely accurate

This you can use to trick SQL into using rand() the way you want!

1. Setup a view like:

create view vwRand as

select rand() r

2. Create a UDF from that View

create function dbo.myrand()

returns float

as

begin

return(select r from vwRand)

end

3. just try it:

select name, dbo.myrand() Rnd

from master.dbo.sysobjects

order by name

Mind you that this is not very fast. If using SQL2005 then retrieving the random number from a CLR udf will improve performance by orders of magnitude.

Good Luck,

* Noel

Jeff Moden SSC Guru Points: 1004619 More actions · Answer 3

That's cool...

You can also use NEWID() as the seed for Rand... in the algo below, "RANGE" is how many numbers you want in the range of random numbers and "OFFSET" is what the starting number should be...

SELECT RAND(CAST(NewID() AS VARBINARY))*range+offset

FROM sometable

For random numbers from 200 to 299, for example, you would use...

SELECT RAND(CAST(NewID() AS VARBINARY))*100+200

FROM sometable

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ed B Hall of Fame Points: 3686 More actions · Answer 4

Is this sufficient?

DECLARE @NumberOfGroups INT = 10
SELECT a.[name],
        a.Rn%@NumberOfGroups + 1 AS Grp
FROM (
        SELECT [name], ROW_NUMBER() OVER (ORDER BY  NEWID()) AS Rn
        FROM SYS.OBJECTS
    ) AS a

Random Assignment of Groups for Simulation

Cookies on SQLServerCentral