Random Assignment of Groups for Simulation

  • I am working on a simulation which takes 120-300 records, randomly assigns them to 10 arbitrary groups and then performs a statistical analysis. The statistical analysis is not problematic, but the random group assignment requires some additional work. Note: the groups must be evenly distributed as in each group should have the same number of records in it. For example, if there were 100 records and ten groups, each group would have ten records.  I would also like to use the best random number generator possible. Finally, this portion of the simulation exercise will be repeated for 50K to 1M iterations (TBD). As such, having the most performant solution, would be best.

    My current solution is to assign a random number to each row, sort the recordset on the random number, and assign the groups based on rank. I can only imagine that there is a better solution.

    I have found this article insightful:

    http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnsqlpro04/html/sp04c1.asp

    The above article also highlights a key problem with the Rand() function in that Rand() is only evaluated once per query AND that Rand() cannot be used in a UDF. Even in SQL Server 2005, use of Rand() in a UDF returns the following: "Invalid use of side-effecting or time-dependent operator in 'rand' within a function." I have not experimented with a .Net assembly yet.

    I also did not know if it would be worthwhile to use the SSIS "Row Sampling Transformation" in an iterative fashion to perform the sampling. The only catch is that all records have to be assigned to a group. Therefore, only picking one sample does not help.

    Any ideas?

    Note: I am using SQL Server 2005 and can also make use of .Net CLR.

  • Here's a self standing sample that you may run with impunity...

    This turned out to be a bit long winded because an ORDER BY in an UPDATE in the presence of IS NULL just doesn't sort correctly.  So, I had to revert to a temporary table to hold the randomized sample...

    In order to make the "perfect" distribution you wanted with a possibility that the number of records may not be evenly divisible by the number of groups, I had to make two distributions... one for the "perfect" distribution, and one for the stragglers.

    Sorry it's not 100% loopless although each loop (1 for each group) is set based.  It's still pretty fast.  I haven't played with some of the new ranking features of 2005 but this may give you an idea...

    I sure do hope this is what you meant... lemme know, eh?

    --========================================================

    --      Setup for the experiment

    --========================================================

    --===== Supress the autodisplay of rowcounts for appearance

        SET NOCOUNT ON

    --===== Make sure the temp table doesn't exist so can run

         -- more than once

         IF OBJECT_ID('TempDB..#Records') IS NOT NULL

            DROP TABLE #Records

    --===== This simulates some group of records from 120-300

         -- I use temp tables for these type of experiments.

         -- You would use permanent tables.

     SELECT TOP 123  --<< Change this number to simulate diff group sizes

            IDENTITY(INT,1,1) AS RecNo,

            CAST(NULL AS INT) AS GroupID

       INTO #Records

       FROM Master.dbo.SysComments sc1,

            Master.dbo.SysComments sc2

    --========================================================

    --      All set... demo the solution

    --      This makes perfect distributions.  If the number

    --      of records is not an even product of 10, the

    --      remaining records are NOT assigned...

    --========================================================

    --===== Create a "sample" table to hold random RecNo's

         -- that will will be updated with the same GroupID.

         -- This is necessary because including an ORDER BY

         -- in a sub-query of an update is basically useless

         -- when you look for IS NULL... it always returns

         -- things in order by RecNo for some reason.

         IF OBJECT_ID('TempDB..#Sample') IS NOT NULL

            DROP TABLE #Sample

     CREATE TABLE #SAMPLE (RecNo INT)

    --===== Create a variable to hold the number of groups

    DECLARE @Groups INT --<< parameterize this if making a proc

        SET @Groups = 10

    --===== Determine the "perfect" number of records for each group

    DECLARE @GroupSize INT

     SELECT @GroupSize = COUNT(*)/@Groups

       FROM #Records

    --===== This is our "distribution/group" counter

    DECLARE @CurrentGroup INT

    --===== This forces the update to only update a certain

         -- number of records based on group size

        SET ROWCOUNT @GroupSize

    --===== Perform the "perfect" distribution randomly

        SET @CurrentGroup  = 1

      WHILE @CurrentGroup <= @Groups

      BEGIN

           TRUNCATE TABLE #Sample

             INSERT INTO #Sample (RecNO)

             SELECT RecNO

               FROM #Records

              WHERE GroupID IS NULL

              ORDER BY NEWID()

             UPDATE #Records

                SET GroupID = @CurrentGroup

               FROM #Records r,

                    #Sample s

               WHERE r.RecNo=s.RecNo

        SET @CurrentGroup = @CurrentGroup+1

        END

    --========================================================

    --      This picks up the rest of the records... there

    --      will always be less than the group of 10 here so

    --      each record must be given a different number. The

    --      lower groups are favored here.

    --========================================================

    --===== This forces the update to only affect 1 row per

         -- loop

        SET ROWCOUNT 1

    --===== Perform the remaining distribution, 1 rec per group,

         -- randomly by RecNo but favors the lower numbered groups.

        SET @CurrentGroup = 1

      WHILE @CurrentGroup <= @Groups

      BEGIN

            --===== Create the random sample

           TRUNCATE TABLE #Sample

             INSERT INTO #Sample (RecNO)

             SELECT RecNo

               FROM #Records

              WHERE GroupID IS NULL

              ORDER BY NEWID()

            --===== Distribute the random sample

             UPDATE #Records

                SET GroupID = @CurrentGroup

               FROM #Records r,

                    #Sample s

               WHERE r.RecNo=s.RecNo

        SET @CurrentGroup = @CurrentGroup+1

         END

    --===== All done... put things back to normal

        SET ROWCOUNT 0

    --===== Display the membership count by GroupID

     SELECT 'This is the distribution of GroupID''s.'

     SELECT GroupID,COUNT(*) AS MembershipCount

       FROM #Records

      GROUP BY GroupID

    --===== Display the records and the assigned groups

     SELECT 'Here are the records and the randomized distributed groups.'

     SELECT * FROM #Records

    p.s.  The "Best" random number generator is the NEWID() function.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I do agree with Jeff that for perromance sake you should use the newID but if you need ramdoness quality rand() is the way to go.

    You and that article said:

    "The above article also highlights a key problem with the Rand() function in that Rand() is only evaluated once per query AND that Rand() cannot be used in a UDF"

    That is not entirely accurate

    This you can use to trick SQL into using rand() the way you want!

    1. Setup a view like:

    create view vwRand as

    select rand() r

    2. Create a UDF from that View

    create function dbo.myrand()

    returns float

    as

    begin

     return(select r from vwRand)

    end

    3. just try it:

    select name, dbo.myrand() Rnd

    from master.dbo.sysobjects

    order by name

    Mind you that this is not very fast. If using SQL2005 then retrieving the random number from a CLR udf will improve performance by orders of magnitude.   

    Good Luck,

     

     


    * Noel

  • That's cool...

    You can also use NEWID() as the seed for Rand...  in the algo below, "RANGE" is how many numbers you want in the range of random numbers and "OFFSET" is what the starting number should be...

    SELECT RAND(CAST(NewID() AS VARBINARY))*range+offset

      FROM sometable

    For random numbers from 200 to 299, for example, you would use...

    SELECT RAND(CAST(NewID() AS VARBINARY))*100+200

      FROM sometable

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • This was removed by the editor as SPAM

  • Is this sufficient?

    DECLARE @NumberOfGroups INT = 10
    SELECT a.[name],
    a.Rn%@NumberOfGroups + 1 AS Grp
    FROM (
    SELECT [name], ROW_NUMBER() OVER (ORDER BY NEWID()) AS Rn
    FROM SYS.OBJECTS
    ) AS a
  • This was removed by the editor as SPAM

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply