Jeff Moden (3/14/2010)
If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...
I like this idea a lot. Keeping it all in one query (though it is a shame that NEWID is illegal in a function):
SET STATISTICS TIME ON;
WITH SelectionCounts
AS (
-- Number of rows to select from each group
-- Minimum 3, total for all groups = 25
-- Choose one of the possible combinations at random
SELECT TOP (1)
nc = C.n,
nt = T.n,
nr = R.n
FROM dbo.Number C,
dbo.Number T,
dbo.Number R
WHERE C.n BETWEEN 3 AND 19
AND T.n BETWEEN 3 AND 19
AND R.n BETWEEN 3 AND 19
AND C.n + T.n + R.n = 25
ORDER BY
NEWID()
),
GroupCounts
AS (
--Find the total number of vehicles of each type
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'C'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'T'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'R'
GROUP BY
vehicle_type
),
Parameters
AS (
SELECT GC.group_size,
GC.vehicle_type,
sample_size =
CASE GC.vehicle_type
WHEN 'C' THEN SC.nc
WHEN 'T' THEN SC.nt
WHEN 'R' THEN SC.nr
ELSE 0
END
FROM GroupCounts GC,
SelectionCounts SC
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
P.vehicle_type,
Random.rn
FROM Parameters P
CROSS
APPLY (
SELECT TOP (P.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * P.group_size + 1)
FROM dbo.Number N
ORDER BY
N.n ASC
) Random
ORDER BY
P.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM dbo.Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;
SET STATISTICS TIME OFF;
The above code returns the expected even distribution, and runs in an average of 94ms - exactly the same as the previous method.
Using a persisted table for one of the 153 possible combinations results in a slightly cleaner plan, but the CPU time remains at 94ms.
Paul
Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi