select random records but for each category

  • I have a request to pull out random records from one table but atleast 3 records from each category without exceeding 25 records for all.

    I know how to pull out the random records, but how can I pull out 3 random for each category?

    example:

    table1 has 100 records

    3 categories, (car, truck, RV)

    I need to pull out car records, truck records, and RV records, all random and no more then 25 in total and no less then 3 from each category,

    is something like this possible with using 1 table like I have? I'm currently pulling a random record set from this table, but now I have to break it down by categories

    this is the query i'm using to pull out the top 10 percent, no more then 25 and I'm updated an isSelected Field so its not picked again. How can I take this one step further and bring back at least 3 for each category?

    SET ROWCOUNT 25

    INSERT INTO @SelectedId

    SELECT TOP 10 percent cId

    FROM CarInformation

    where salesman= @SalesPerson and isSelected is Null and DateClosed is NOT NULL and [quarter] = @Qtr and [Year] = @Year

    ORDER BY NEWID()

    UPDATE CarInformation

    SET isSelected = 1

    FROM CarInformation AS F

    INNER JOIN @SelectedId AS S

    ON F.cId = S.cId

    SELECT * FROM CarInformation AS F

    INNER JOIN @SelectedId AS S ON F.clmId = S.cId

  • SELECT <column list> FROM (

    SELECT <column list>, ROW_NUMBER() OVER (Partition By Category Order By (SELECT NEWID())) AS RowNo

    FROM Table1) sub

    WHERE rowNo <=3

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hmmmm.... this problem is coming up more and more... must be a new homework problem by a given teacher in some school or maybe a favorite "take home" interview question by a certain DBA.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • nope no homework, no interview questions, its an actualy business requirement for this application that was once an access app and everything was done manually, now i'm automating it and I need to duplicate the selecting of random records, where before a user went in and select 5 from one, 5 from another and so on for a total of 25, now I need to automate that portion of the new tool.

  • As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

    Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.

    -- Drop our test table if it exists

    IF OBJECT_ID(N'tempdb..#Source', N'U')

    IS NOT NULL

    DROP TABLE #Source;

    GO

    -- Test table

    CREATE TABLE #Source

    (

    row_id INTEGER IDENTITY PRIMARY KEY,

    vehicle_type CHAR(1) NOT NULL,

    vehicle_id INTEGER NOT NULL

    );

    GO

    -- Add 100,000 random rows, roughly evenly distributed

    -- across vehicle types

    INSERT #Source

    (

    vehicle_type,

    vehicle_id

    )

    SELECT TOP (100000)

    CASE

    -- Chance of being a car

    WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'

    -- Chance of being a truck

    WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'

    -- Chance of being an RV

    ELSE 'R'

    END,

    -- Random vehicle id

    ABS(CHECKSUM(NEWID()))

    FROM master.sys.columns C1,

    master.sys.columns C2,

    master.sys.columns C3;

    GO

    -- A helpful index

    CREATE INDEX nc1

    ON #Source

    (vehicle_type ASC, row_id ASC)

    INCLUDE (vehicle_id)

    WITH (FILLFACTOR = 100, MAXDOP = 1);

    GO

    -- Show the number of vehicles of each type

    SELECT vehicle_type,

    row_count = COUNT_BIG(*)

    FROM #Source

    GROUP BY

    vehicle_type

    ORDER BY

    vehicle_type;

    GO

    SET STATISTICS TIME ON;

    -- Select a random sample of rows from each group

    -- Minimum 3 rows, maximum 25, 10% of the group size othewise

    WITH SampleSize

    AS (

    -- Find the total number of vehicles of each type

    -- and the sample size to use for each group

    -- (3 rows, 25 rows, or 10%)

    SELECT vehicle_type,

    group_size = COUNT_BIG(*),

    sample_size =

    CASE

    WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25

    WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03

    ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))

    END

    FROM #Source

    GROUP BY

    vehicle_type

    ),

    RandomRows

    AS (

    -- Generate a random row number for each row required from

    -- each group. Maximum rows returned = 25 * group_count

    --

    -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

    -- force the query plan to calculate the random row number

    -- before the final join

    SELECT TOP (9223372036854775807)

    SS.vehicle_type,

    Random.rn

    FROM SampleSize SS

    CROSS

    APPLY (

    SELECT TOP (SS.sample_size)

    rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)

    FROM master.sys.columns C1,

    master.sys.columns C2

    ) Random

    ORDER BY

    SS.vehicle_type,

    Random.rn

    ),

    NumberedSourceRows

    AS (

    -- Number the source rows, restarting the numbering for each group

    SELECT S.row_id,

    S.vehicle_type,

    S.vehicle_id,

    rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

    FROM #Source S

    )

    -- Fetch the numbered rows that match the random row number, per group

    SELECT NSR.row_id,

    NSR.vehicle_type,

    NSR.vehicle_id

    FROM RandomRows RR

    JOIN NumberedSourceRows NSR

    ON NSR.vehicle_type = RR.vehicle_type

    AND NSR.rn = RR.rn;

    SET STATISTICS TIME OFF;

    GO

    IF OBJECT_ID(N'tempdb..#Source', N'U')

    IS NOT NULL

    DROP TABLE #Source;

    GO

    Paul

  • Paul White (3/12/2010)


    As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

    Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.

    -- Drop our test table if it exists

    IF OBJECT_ID(N'tempdb..#Source', N'U')

    IS NOT NULL

    DROP TABLE #Source;

    GO

    -- Test table

    CREATE TABLE #Source

    (

    row_id INTEGER IDENTITY PRIMARY KEY,

    vehicle_type CHAR(1) NOT NULL,

    vehicle_id INTEGER NOT NULL

    );

    GO

    -- Add 100,000 random rows, roughly evenly distributed

    -- across vehicle types

    INSERT #Source

    (

    vehicle_type,

    vehicle_id

    )

    SELECT TOP (100000)

    CASE

    -- Chance of being a car

    WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'

    -- Chance of being a truck

    WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'

    -- Chance of being an RV

    ELSE 'R'

    END,

    -- Random vehicle id

    ABS(CHECKSUM(NEWID()))

    FROM master.sys.columns C1,

    master.sys.columns C2,

    master.sys.columns C3;

    GO

    -- A helpful index

    CREATE INDEX nc1

    ON #Source

    (vehicle_type ASC, row_id ASC)

    INCLUDE (vehicle_id)

    WITH (FILLFACTOR = 100, MAXDOP = 1);

    GO

    -- Show the number of vehicles of each type

    SELECT vehicle_type,

    row_count = COUNT_BIG(*)

    FROM #Source

    GROUP BY

    vehicle_type

    ORDER BY

    vehicle_type;

    GO

    SET STATISTICS TIME ON;

    -- Select a random sample of rows from each group

    -- Minimum 3 rows, maximum 25, 10% of the group size othewise

    WITH SampleSize

    AS (

    -- Find the total number of vehicles of each type

    -- and the sample size to use for each group

    -- (3 rows, 25 rows, or 10%)

    SELECT vehicle_type,

    group_size = COUNT_BIG(*),

    sample_size =

    CASE

    WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25

    WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03

    ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))

    END

    FROM #Source

    GROUP BY

    vehicle_type

    ),

    RandomRows

    AS (

    -- Generate a random row number for each row required from

    -- each group. Maximum rows returned = 25 * group_count

    --

    -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

    -- force the query plan to calculate the random row number

    -- before the final join

    SELECT TOP (9223372036854775807)

    SS.vehicle_type,

    Random.rn

    FROM SampleSize SS

    CROSS

    APPLY (

    SELECT TOP (SS.sample_size)

    rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)

    FROM master.sys.columns C1,

    master.sys.columns C2

    ) Random

    ORDER BY

    SS.vehicle_type,

    Random.rn

    ),

    NumberedSourceRows

    AS (

    -- Number the source rows, restarting the numbering for each group

    SELECT S.row_id,

    S.vehicle_type,

    S.vehicle_id,

    rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

    FROM #Source S

    )

    -- Fetch the numbered rows that match the random row number, per group

    SELECT NSR.row_id,

    NSR.vehicle_type,

    NSR.vehicle_id

    FROM RandomRows RR

    JOIN NumberedSourceRows NSR

    ON NSR.vehicle_type = RR.vehicle_type

    AND NSR.rn = RR.rn;

    SET STATISTICS TIME OFF;

    GO

    IF OBJECT_ID(N'tempdb..#Source', N'U')

    IS NOT NULL

    DROP TABLE #Source;

    GO

    Paul

    Nicely done but...

    without exceeding 25 records for all.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • They key to simplicity (and, therefor, future maintainability) for this problem is that you have to know how many of each vehicle type to return BEFORE you try to select them. Using Paul's test data, the following meets the spirit of the problem definition...

    SET STATISTICS TIME ON

    ;

    --===== Declare some variables to hold the counts for each vehicle type.

    DECLARE @CountC INT,

    @CountR INT,

    @CountT INT

    ;

    --===== Randomly assign the counts to be returned with the constraint that

    -- no count must be less than 3 and the total count must be 25.

    SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,

    @CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,

    @CountT = 25-@CountC-@CountR

    ;

    --===== Make the random vehicle selection based on the counts and return everything

    -- as a single randomly ordered result set.

    WITH

    cteC AS (SELECT TOP (@CountC) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'C' ORDER BY NEWID()),

    cteR AS (SELECT TOP (@CountR) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'R' ORDER BY NEWID()),

    cteT AS (SELECT TOP (@CountT) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'T' ORDER BY NEWID()),

    cteALL AS (SELECT Vehicle_Type, Vehicle_ID FROM cteC UNION ALL

    SELECT Vehicle_Type, Vehicle_ID FROM cteR UNION ALL

    SELECT Vehicle_Type, Vehicle_ID FROM cteT)

    SELECT Vehicle_Type, Vehicle_ID

    FROM cteAll

    ORDER BY NEWID()

    ;

    SET STATISTICS TIME OFF

    ;

    In the formulas...

    The number "25" is the maximum number of rows to return for all categories combined.

    The number "16" is the number 25 minus 9 (the minimum number of rows (3) to return for each category times the number of categories (3)).

    The number "3" is the minimum number of rows to return for each category.

    The number "6" is minimum number of rows (3) you have to return for the 2 remaining categores.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/13/2010)


    Nicely done but...

    without exceeding 25 records for all.

    Well that is just embarrassing. :blush:

    Your solution has some nice features, particularly in the calculation of counts.

    It does raise a question, though: in which order should we allocate?

    You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

    My second point concerns performance.

    The code performs three partial scans of the index, and three sorts.

    CPU time on my machine: 188ms - averaged over ten runs.

    Combining the two approaches reduces the average time to 94ms:

    SET STATISTICS TIME ON;

    --===== Declare some variables to hold the counts for each vehicle type.

    DECLARE @CountC INT,

    @CountR INT,

    @CountT INT;

    --===== Randomly assign the counts to be returned with the constraint that

    -- no count must be less than 3 and the total count must be 25.

    SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,

    @CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,

    @CountT = 25-@CountC-@CountR;

    WITH SampleSize

    AS (

    --Find the total number of vehicles of each type

    --and the sample size to use for each group

    --(3 rows, 25 rows, or 10%)

    SELECT vehicle_type,

    group_size = COUNT_BIG(*),

    sample_size = @CountC

    FROM #Source

    WHERE vehicle_type = 'C'

    GROUP BY

    vehicle_type

    UNION ALL

    SELECT vehicle_type,

    group_size = COUNT_BIG(*),

    sample_size = @CountT

    FROM #Source

    WHERE vehicle_type = 'T'

    GROUP BY

    vehicle_type

    UNION ALL

    SELECT vehicle_type,

    group_size = COUNT_BIG(*),

    sample_size = @CountR

    FROM #Source

    WHERE vehicle_type = 'R'

    GROUP BY

    vehicle_type

    ),

    RandomRows

    AS (

    -- Generate a random row number for each row required from

    -- each group. Maximum rows returned = 25 * group_count

    --

    -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

    -- force the query plan to calculate the random row number

    -- before the final join

    SELECT TOP (9223372036854775807)

    SS.vehicle_type,

    Random.rn

    FROM SampleSize SS

    CROSS

    APPLY (

    SELECT TOP (SS.sample_size)

    rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * SS.group_size + 1)

    FROM dbo.Number N

    ORDER BY

    N.n ASC

    ) Random

    ORDER BY

    SS.vehicle_type,

    Random.rn

    ),

    NumberedSourceRows

    AS (

    -- Number the source rows, restarting the numbering for each group

    SELECT S.row_id,

    S.vehicle_type,

    S.vehicle_id,

    rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

    FROM #Source S

    )

    -- Fetch the numbered rows that match the random row number, per group

    SELECT NSR.row_id,

    NSR.vehicle_type,

    NSR.vehicle_id

    FROM RandomRows RR

    JOIN NumberedSourceRows NSR

    ON NSR.vehicle_type = RR.vehicle_type

    AND NSR.rn = RR.rn;

    SET STATISTICS TIME OFF;

    For what it's worth, the minimum-3 maximum-25 from each group problem was much more fun!

    Paul

  • Paul White (3/13/2010)


    It does raise a question, though: in which order should we allocate?

    You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

    I initially thought of that, too. But I believe that there will be a nice balance to the 3 insofar as which will have the largest numbers... the first random selection has just as much a chance of being very low as it does very high. Same goes for the others. During some very minor testing, for example, it did produce 3,3,16 a couple of times. I've not done an actual "distribution" test yet to see if it's more random than not, but mathematically, it would seem so. Guess I'll have to do a test because, heh... one test is worth a thousand "expert opinions" (especially mine :-P).

    My second point concerns performance.

    The code performs three partial scans of the index, and three sorts.

    CPU time on my machine: 188ms - averaged over ten runs.

    Combining the two approaches reduces the average time to 94ms:

    I was thinking the same thing afterwards. Considering that I just did a review on your Cross Apply article, the notion struck me that your method and mine could be combined for the best of both worlds, but you beat me to it. You're much better at the use of Cross Apply and it seems to just come natural while I still have to think about it. :blush: It'll become more natural for me as I use it more, but you're still light years ahead of me on its usage.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers. Here's the test code if you'd like to play...

    WITH

    cteC AS

    ( --=== Determine "C"

    SELECT CountC = ABS(CHECKSUM(NEWID()))%16+3

    FROM dbo.Tally t

    WHERE t.N <= 10000

    ),

    cteR AS

    ( --=== Determine "R" constrained by "C"

    SELECT CountC,

    CountR = ABS(CHECKSUM(NEWID()))%(25-CountC-6)+3

    FROM cteC

    ),

    cteT AS

    ( --=== Determine "T" constrained by "C" and "R"

    SELECT CountC,

    CountR,

    CountT = 25-CountC-CountR

    FROM cteR

    ) --=== Unpivot the 3 columns of data and do counts of occurances for each Count"x" value.

    SELECT unpvt.CountName, unpvt.CountX, COUNT(*) AS Occurances

    FROM (

    SELECT CountC,CountR,CountT

    FROM cteT

    ) pvt

    UNPIVOT (

    CountX FOR CountName IN (CountC,CountR,CountT)

    )AS unpvt

    GROUP BY unpvt.CountName, unpvt.CountX

    ORDER BY unpvt.CountName, unpvt.CountX

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok.... so far, I'm at a total loss as to why the formulas I wrote don't have a more even distibution. Sooooo.... I'll use the power of the computer to overcome my apparent lack of mathematical prowess...

    If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...

    Here's the code for the "helper" table as above...

    --===== Create a new permanent table with precalculated

    -- vehicle counts with distributions that always

    -- have at least a count of 3 per vehicle type and

    -- have a sum across the vehicle types in each row

    -- that add up to exactly 25.

    SELECT IDENTITY(INT,0,1) AS VehicleTypeCountID,

    tc.N AS CarCount,

    tr.N AS RVCount,

    tt.N AS TruckCount

    INTO dbo.VehicleTypeCount

    FROM dbo.Tally tc

    CROSS JOIN dbo.Tally tr

    CROSS JOIN dbo.Tally tt

    WHERE tc.N BETWEEN 3 AND 19

    AND tr.N BETWEEN 3 AND 19

    AND tt.N BETWEEN 3 AND 19

    AND tc.N + tr.N + tt.N = 25

    ORDER BY CarCount, RVCount, TruckCount --Just for "sanity" checks

    --===== Add the very necessary clustered index (PK in this case)

    -- which will make finding a random selection very fast.

    ALTER TABLE dbo.VehicleTypeCount

    ADD CONSTRAINT PK_VehicleTypeCount

    PRIMARY KEY CLUSTERED (VehicleTypeCountID

    ... and here's the code to make the assignments to the 3 variables we've been using...

    --===== Demonstrate the random selection of distributed counts

    DECLARE @CountC INT,

    @CountR INT,

    @CountT INT

    SELECT @CountC = CarCount,

    @CountR = RVCount,

    @CountT = TruckCount

    FROM dbo.VehicleTypeCount

    WHERE VehicleTypeCountID = ABS(CHECKSUM(NEWID())) % 153 --Number of rows in table

    --===== Display the content of the variables for the "warm fuzzies"

    SELECT @CountC, @CountR, @CountT

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/14/2010)


    Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers.

    Nice UNPIVOT there Jeff 🙂

    About the distribution:

    C takes a random integer value between 3 and 19. The average value of C is S(3...19) / 16 = 11

    R can now be any random integer value between 3 and (19-C).

    Using the average value of 11 for C, the range of R is (3...11) on average. S(3...11) / 9 = 7

    T ranges from 3 to (25 - R - C) => (25 - 7 - 11) => 7. The range of T is therefore (3...7) on average, and it has an average value of 5.

    Running your test rig over the first million integers (1...1000000), the averages, as reported by the AVG aggregate are:

    C = 10

    R = 6

    T = 7

    Paul

  • Yeah, but the MODE is flat for C... average shouldn't even come into play on this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/14/2010)


    If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...

    I like this idea a lot. Keeping it all in one query (though it is a shame that NEWID is illegal in a function):

    SET STATISTICS TIME ON;

    WITH SelectionCounts

    AS (

    -- Number of rows to select from each group

    -- Minimum 3, total for all groups = 25

    -- Choose one of the possible combinations at random

    SELECT TOP (1)

    nc = C.n,

    nt = T.n,

    nr = R.n

    FROM dbo.Number C,

    dbo.Number T,

    dbo.Number R

    WHERE C.n BETWEEN 3 AND 19

    AND T.n BETWEEN 3 AND 19

    AND R.n BETWEEN 3 AND 19

    AND C.n + T.n + R.n = 25

    ORDER BY

    NEWID()

    ),

    GroupCounts

    AS (

    --Find the total number of vehicles of each type

    SELECT vehicle_type,

    group_size = COUNT_BIG(*)

    FROM dbo.Source

    WHERE vehicle_type = 'C'

    GROUP BY

    vehicle_type

    UNION ALL

    SELECT vehicle_type,

    group_size = COUNT_BIG(*)

    FROM dbo.Source

    WHERE vehicle_type = 'T'

    GROUP BY

    vehicle_type

    UNION ALL

    SELECT vehicle_type,

    group_size = COUNT_BIG(*)

    FROM dbo.Source

    WHERE vehicle_type = 'R'

    GROUP BY

    vehicle_type

    ),

    Parameters

    AS (

    SELECT GC.group_size,

    GC.vehicle_type,

    sample_size =

    CASE GC.vehicle_type

    WHEN 'C' THEN SC.nc

    WHEN 'T' THEN SC.nt

    WHEN 'R' THEN SC.nr

    ELSE 0

    END

    FROM GroupCounts GC,

    SelectionCounts SC

    ),

    RandomRows

    AS (

    -- Generate a random row number for each row required from

    -- each group. Maximum rows returned = 25 * group_count

    --

    -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

    -- force the query plan to calculate the random row number

    -- before the final join

    SELECT TOP (9223372036854775807)

    P.vehicle_type,

    Random.rn

    FROM Parameters P

    CROSS

    APPLY (

    SELECT TOP (P.sample_size)

    rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * P.group_size + 1)

    FROM dbo.Number N

    ORDER BY

    N.n ASC

    ) Random

    ORDER BY

    P.vehicle_type,

    Random.rn

    ),

    NumberedSourceRows

    AS (

    -- Number the source rows, restarting the numbering for each group

    SELECT S.row_id,

    S.vehicle_type,

    S.vehicle_id,

    rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

    FROM dbo.Source S

    )

    -- Fetch the numbered rows that match the random row number, per group

    SELECT NSR.row_id,

    NSR.vehicle_type,

    NSR.vehicle_id

    FROM RandomRows RR

    JOIN NumberedSourceRows NSR

    ON NSR.vehicle_type = RR.vehicle_type

    AND NSR.rn = RR.rn;

    SET STATISTICS TIME OFF;

    The above code returns the expected even distribution, and runs in an average of 94ms - exactly the same as the previous method.

    Using a persisted table for one of the 153 possible combinations results in a slightly cleaner plan, but the CPU time remains at 94ms.

    Paul

  • Jeff Moden (3/14/2010)


    Yeah, but the MODE is flat for C... average shouldn't even come into play on this.

    I do not count math among my stronger skills 🙂

    The idea was that the average for C determines the average range for R...and so on. It might be complete nonsense...:unsure:

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply