RE: select random records but for each category

Paul White

SSC Guru

Points: 150467

March 13, 2010 at 8:33 pm

#1133294

Jeff Moden (3/13/2010)
Nicely done but...
without exceeding 25 records for all.

Well that is just embarrassing. :blush:

Your solution has some nice features, particularly in the calculation of counts.

It does raise a question, though: in which order should we allocate?

You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

My second point concerns performance.

The code performs three partial scans of the index, and three sorts.

CPU time on my machine: 188ms - averaged over ten runs.

Combining the two approaches reduces the average time to 94ms:

SET STATISTICS TIME ON;

--===== Declare some variables to hold the counts for each vehicle type.

DECLARE @CountC INT,

@CountR INT,

@CountT INT;

--===== Randomly assign the counts to be returned with the constraint that

-- no count must be less than 3 and the total count must be 25.

SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,

@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,

@CountT = 25-@CountC-@CountR;

WITH SampleSize

AS (

--Find the total number of vehicles of each type

--and the sample size to use for each group

--(3 rows, 25 rows, or 10%)

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size = @CountC

FROM #Source

WHERE vehicle_type = 'C'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size = @CountT

FROM #Source

WHERE vehicle_type = 'T'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size = @CountR

FROM #Source

WHERE vehicle_type = 'R'

GROUP BY

vehicle_type

),

RandomRows

AS (

-- Generate a random row number for each row required from

-- each group. Maximum rows returned = 25 * group_count

--

-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

-- force the query plan to calculate the random row number

-- before the final join

SELECT TOP (9223372036854775807)

SS.vehicle_type,

Random.rn

FROM SampleSize SS

CROSS

APPLY (

SELECT TOP (SS.sample_size)

rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * SS.group_size + 1)

FROM dbo.Number N

ORDER BY

N.n ASC

) Random

ORDER BY

SS.vehicle_type,

Random.rn

),

NumberedSourceRows

AS (

-- Number the source rows, restarting the numbering for each group

SELECT S.row_id,

S.vehicle_type,

S.vehicle_id,

rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

FROM #Source S

)

-- Fetch the numbered rows that match the random row number, per group

SELECT NSR.row_id,

NSR.vehicle_type,

NSR.vehicle_id

FROM RandomRows RR

JOIN NumberedSourceRows NSR

ON NSR.vehicle_type = RR.vehicle_type

AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;

For what it's worth, the minimum-3 maximum-25 from each group problem was much more fun!

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi