select random records but for each category

Question

Post reply

select random records but for each category

SQL_NuB

Hall of Fame

Points: 3666
More actions
March 11, 2010 at 1:19 pm

#216484

I have a request to pull out random records from one table but atleast 3 records from each category without exceeding 25 records for all.
I know how to pull out the random records, but how can I pull out 3 random for each category?
example:
table1 has 100 records
3 categories, (car, truck, RV)
I need to pull out car records, truck records, and RV records, all random and no more then 25 in total and no less then 3 from each category,
is something like this possible with using 1 table like I have? I'm currently pulling a random record set from this table, but now I have to break it down by categories
this is the query i'm using to pull out the top 10 percent, no more then 25 and I'm updated an isSelected Field so its not picked again. How can I take this one step further and bring back at least 3 for each category?
SET ROWCOUNT 25
INSERT INTO @SelectedId
SELECT TOP 10 percent cId
FROM CarInformation
where salesman= @SalesPerson and isSelected is Null and DateClosed is NOT NULL and [quarter] = @Qtr and [Year] = @Year
ORDER BY NEWID()
UPDATE CarInformation
SET isSelected = 1
FROM CarInformation AS F
INNER JOIN @SelectedId AS S
ON F.cId = S.cId
SELECT * FROM CarInformation AS F
INNER JOIN @SelectedId AS S ON F.clmId = S.cId

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

SELECT <column list> FROM (

SELECT <column list>, ROW_NUMBER() OVER (Partition By Category Order By (SELECT NEWID())) AS RowNo

FROM Table1) sub

WHERE rowNo <=3

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 2

Hmmmm.... this problem is coming up more and more... must be a new homework problem by a given teacher in some school or maybe a favorite "take home" interview question by a certain DBA.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

SQL_NuB Hall of Fame Points: 3666 More actions · Answer 3

nope no homework, no interview questions, its an actualy business requirement for this application that was once an access app and everything was done manually, now i'm automating it and I need to duplicate the selecting of random records, where before a user went in and select 5 from one, 5 from another and so on for a total of 25, now I need to automate that portion of the new tool.

Paul White SSC Guru Points: 150467 More actions · Answer 4

As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.

-- Drop our test table if it exists

IF OBJECT_ID(N'tempdb..#Source', N'U')

IS NOT NULL

DROP TABLE #Source;

GO

-- Test table

CREATE TABLE #Source

(

row_id INTEGER IDENTITY PRIMARY KEY,

vehicle_type CHAR(1) NOT NULL,

vehicle_id INTEGER NOT NULL

);

GO

-- Add 100,000 random rows, roughly evenly distributed

-- across vehicle types

INSERT #Source

(

vehicle_type,

vehicle_id

)

SELECT TOP (100000)

CASE

-- Chance of being a car

WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'

-- Chance of being a truck

WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'

-- Chance of being an RV

ELSE 'R'

END,

-- Random vehicle id

ABS(CHECKSUM(NEWID()))

FROM master.sys.columns C1,

master.sys.columns C2,

master.sys.columns C3;

GO

-- A helpful index

CREATE INDEX nc1

ON #Source

(vehicle_type ASC, row_id ASC)

INCLUDE (vehicle_id)

WITH (FILLFACTOR = 100, MAXDOP = 1);

GO

-- Show the number of vehicles of each type

SELECT vehicle_type,

row_count = COUNT_BIG(*)

FROM #Source

GROUP BY

vehicle_type

ORDER BY

vehicle_type;

GO

SET STATISTICS TIME ON;

-- Select a random sample of rows from each group

-- Minimum 3 rows, maximum 25, 10% of the group size othewise

WITH SampleSize

AS (

-- Find the total number of vehicles of each type

-- and the sample size to use for each group

-- (3 rows, 25 rows, or 10%)

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size =

CASE

WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25

WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03

ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))

END

FROM #Source

GROUP BY

vehicle_type

),

RandomRows

AS (

-- Generate a random row number for each row required from

-- each group. Maximum rows returned = 25 * group_count

--

-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

-- force the query plan to calculate the random row number

-- before the final join

SELECT TOP (9223372036854775807)

SS.vehicle_type,

Random.rn

FROM SampleSize SS

CROSS

APPLY (

SELECT TOP (SS.sample_size)

rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)

FROM master.sys.columns C1,

master.sys.columns C2

) Random

ORDER BY

SS.vehicle_type,

Random.rn

),

NumberedSourceRows

AS (

-- Number the source rows, restarting the numbering for each group

SELECT S.row_id,

S.vehicle_type,

S.vehicle_id,

rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

FROM #Source S

)

-- Fetch the numbered rows that match the random row number, per group

SELECT NSR.row_id,

NSR.vehicle_type,

NSR.vehicle_id

FROM RandomRows RR

JOIN NumberedSourceRows NSR

ON NSR.vehicle_type = RR.vehicle_type

AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;

GO

IF OBJECT_ID(N'tempdb..#Source', N'U')

IS NOT NULL

DROP TABLE #Source;

GO

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 5

Paul White (3/12/2010)
As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.
Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.
-- Drop our test table if it exists
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO
-- Test table
CREATE TABLE #Source
(
row_id INTEGER IDENTITY PRIMARY KEY,
vehicle_type CHAR(1) NOT NULL,
vehicle_id INTEGER NOT NULL
);
GO
-- Add 100,000 random rows, roughly evenly distributed
-- across vehicle types
INSERT #Source
(
vehicle_type,
vehicle_id
)
SELECT TOP (100000)
CASE
-- Chance of being a car
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'
-- Chance of being a truck
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'
-- Chance of being an RV
ELSE 'R'
END,
-- Random vehicle id
ABS(CHECKSUM(NEWID()))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
-- A helpful index
CREATE INDEX nc1
ON #Source
(vehicle_type ASC, row_id ASC)
INCLUDE (vehicle_id)
WITH (FILLFACTOR = 100, MAXDOP = 1);
GO
-- Show the number of vehicles of each type
SELECT vehicle_type,
row_count = COUNT_BIG(*)
FROM #Source
GROUP BY
vehicle_type
ORDER BY
vehicle_type;
GO
SET STATISTICS TIME ON;
-- Select a random sample of rows from each group
-- Minimum 3 rows, maximum 25, 10% of the group size othewise
WITH SampleSize
AS (
-- Find the total number of vehicles of each type
-- and the sample size to use for each group
-- (3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size =
CASE
WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25
WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03
ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))
END
FROM #Source
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)
FROM master.sys.columns C1,
master.sys.columns C2
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;
SET STATISTICS TIME OFF;
GO
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO
Paul

Nicely done but...

without exceeding 25 records for all.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 6

They key to simplicity (and, therefor, future maintainability) for this problem is that you have to know how many of each vehicle type to return BEFORE you try to select them. Using Paul's test data, the following meets the spirit of the problem definition...

SET STATISTICS TIME ON

;

--===== Declare some variables to hold the counts for each vehicle type.

DECLARE @CountC INT,

@CountR INT,

@CountT INT

;

--===== Randomly assign the counts to be returned with the constraint that

-- no count must be less than 3 and the total count must be 25.

SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,

@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,

@CountT = 25-@CountC-@CountR

;

--===== Make the random vehicle selection based on the counts and return everything

-- as a single randomly ordered result set.

WITH

cteC AS (SELECT TOP (@CountC) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'C' ORDER BY NEWID()),

cteR AS (SELECT TOP (@CountR) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'R' ORDER BY NEWID()),

cteT AS (SELECT TOP (@CountT) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'T' ORDER BY NEWID()),

cteALL AS (SELECT Vehicle_Type, Vehicle_ID FROM cteC UNION ALL

SELECT Vehicle_Type, Vehicle_ID FROM cteR UNION ALL

SELECT Vehicle_Type, Vehicle_ID FROM cteT)

SELECT Vehicle_Type, Vehicle_ID

FROM cteAll

ORDER BY NEWID()

;

SET STATISTICS TIME OFF

;

In the formulas...

The number "25" is the maximum number of rows to return for all categories combined.

The number "16" is the number 25 minus 9 (the minimum number of rows (3) to return for each category times the number of categories (3)).

The number "3" is the minimum number of rows to return for each category.

The number "6" is minimum number of rows (3) you have to return for the 2 remaining categores.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 7

Jeff Moden (3/13/2010)
Nicely done but...
without exceeding 25 records for all.

Well that is just embarrassing. :blush:

Your solution has some nice features, particularly in the calculation of counts.

It does raise a question, though: in which order should we allocate?

You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

My second point concerns performance.

The code performs three partial scans of the index, and three sorts.

CPU time on my machine: 188ms - averaged over ten runs.

Combining the two approaches reduces the average time to 94ms:

SET STATISTICS TIME ON;

--===== Declare some variables to hold the counts for each vehicle type.

DECLARE @CountC INT,

@CountR INT,

@CountT INT;

--===== Randomly assign the counts to be returned with the constraint that

-- no count must be less than 3 and the total count must be 25.

SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,

@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,

@CountT = 25-@CountC-@CountR;

WITH SampleSize

AS (

--Find the total number of vehicles of each type

--and the sample size to use for each group

--(3 rows, 25 rows, or 10%)

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size = @CountC

FROM #Source

WHERE vehicle_type = 'C'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size = @CountT

FROM #Source

WHERE vehicle_type = 'T'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*),

sample_size = @CountR

FROM #Source

WHERE vehicle_type = 'R'

GROUP BY

vehicle_type

),

RandomRows

AS (

-- Generate a random row number for each row required from

-- each group. Maximum rows returned = 25 * group_count

--

-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

-- force the query plan to calculate the random row number

-- before the final join

SELECT TOP (9223372036854775807)

SS.vehicle_type,

Random.rn

FROM SampleSize SS

CROSS

APPLY (

SELECT TOP (SS.sample_size)

rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * SS.group_size + 1)

FROM dbo.Number N

ORDER BY

N.n ASC

) Random

ORDER BY

SS.vehicle_type,

Random.rn

),

NumberedSourceRows

AS (

-- Number the source rows, restarting the numbering for each group

SELECT S.row_id,

S.vehicle_type,

S.vehicle_id,

rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

FROM #Source S

)

-- Fetch the numbered rows that match the random row number, per group

SELECT NSR.row_id,

NSR.vehicle_type,

NSR.vehicle_id

FROM RandomRows RR

JOIN NumberedSourceRows NSR

ON NSR.vehicle_type = RR.vehicle_type

AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;

For what it's worth, the minimum-3 maximum-25 from each group problem was much more fun!

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 8

Paul White (3/13/2010)
It does raise a question, though: in which order should we allocate?
You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

I initially thought of that, too. But I believe that there will be a nice balance to the 3 insofar as which will have the largest numbers... the first random selection has just as much a chance of being very low as it does very high. Same goes for the others. During some very minor testing, for example, it did produce 3,3,16 a couple of times. I've not done an actual "distribution" test yet to see if it's more random than not, but mathematically, it would seem so. Guess I'll have to do a test because, heh... one test is worth a thousand "expert opinions" (especially mine :-P).

My second point concerns performance.
The code performs three partial scans of the index, and three sorts.
CPU time on my machine: 188ms - averaged over ten runs.
Combining the two approaches reduces the average time to 94ms:

I was thinking the same thing afterwards. Considering that I just did a review on your Cross Apply article, the notion struck me that your method and mine could be combined for the best of both worlds, but you beat me to it. You're much better at the use of Cross Apply and it seems to just come natural while I still have to think about it. :blush: It'll become more natural for me as I use it more, but you're still light years ahead of me on its usage.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 9

Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers. Here's the test code if you'd like to play...

WITH

cteC AS

( --=== Determine "C"

SELECT CountC = ABS(CHECKSUM(NEWID()))%16+3

FROM dbo.Tally t

WHERE t.N <= 10000

),

cteR AS

( --=== Determine "R" constrained by "C"

SELECT CountC,

CountR = ABS(CHECKSUM(NEWID()))%(25-CountC-6)+3

FROM cteC

),

cteT AS

( --=== Determine "T" constrained by "C" and "R"

SELECT CountC,

CountR,

CountT = 25-CountC-CountR

FROM cteR

) --=== Unpivot the 3 columns of data and do counts of occurances for each Count"x" value.

SELECT unpvt.CountName, unpvt.CountX, COUNT(*) AS Occurances

FROM (

SELECT CountC,CountR,CountT

FROM cteT

) pvt

UNPIVOT (

CountX FOR CountName IN (CountC,CountR,CountT)

)AS unpvt

GROUP BY unpvt.CountName, unpvt.CountX

ORDER BY unpvt.CountName, unpvt.CountX

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 10

Ok.... so far, I'm at a total loss as to why the formulas I wrote don't have a more even distibution. Sooooo.... I'll use the power of the computer to overcome my apparent lack of mathematical prowess...

If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...

Here's the code for the "helper" table as above...

--===== Create a new permanent table with precalculated

-- vehicle counts with distributions that always

-- have at least a count of 3 per vehicle type and

-- have a sum across the vehicle types in each row

-- that add up to exactly 25.

SELECT IDENTITY(INT,0,1) AS VehicleTypeCountID,

tc.N AS CarCount,

tr.N AS RVCount,

tt.N AS TruckCount

INTO dbo.VehicleTypeCount

FROM dbo.Tally tc

CROSS JOIN dbo.Tally tr

CROSS JOIN dbo.Tally tt

WHERE tc.N BETWEEN 3 AND 19

AND tr.N BETWEEN 3 AND 19

AND tt.N BETWEEN 3 AND 19

AND tc.N + tr.N + tt.N = 25

ORDER BY CarCount, RVCount, TruckCount --Just for "sanity" checks

--===== Add the very necessary clustered index (PK in this case)

-- which will make finding a random selection very fast.

ALTER TABLE dbo.VehicleTypeCount

ADD CONSTRAINT PK_VehicleTypeCount

PRIMARY KEY CLUSTERED (VehicleTypeCountID

... and here's the code to make the assignments to the 3 variables we've been using...

--===== Demonstrate the random selection of distributed counts

DECLARE @CountC INT,

@CountR INT,

@CountT INT

SELECT @CountC = CarCount,

@CountR = RVCount,

@CountT = TruckCount

FROM dbo.VehicleTypeCount

WHERE VehicleTypeCountID = ABS(CHECKSUM(NEWID())) % 153 --Number of rows in table

--===== Display the content of the variables for the "warm fuzzies"

SELECT @CountC, @CountR, @CountT

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 11

Jeff Moden (3/14/2010)
Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers.

Nice UNPIVOT there Jeff 🙂

About the distribution:

C takes a random integer value between 3 and 19. The average value of C is S(3...19) / 16 = 11

R can now be any random integer value between 3 and (19-C).

Using the average value of 11 for C, the range of R is (3...11) on average. S(3...11) / 9 = 7

T ranges from 3 to (25 - R - C) => (25 - 7 - 11) => 7. The range of T is therefore (3...7) on average, and it has an average value of 5.

Running your test rig over the first million integers (1...1000000), the averages, as reported by the AVG aggregate are:

C = 10

R = 6

T = 7

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 12

Yeah, but the MODE is flat for C... average shouldn't even come into play on this.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150467 More actions · Answer 13

Jeff Moden (3/14/2010)
If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...

I like this idea a lot. Keeping it all in one query (though it is a shame that NEWID is illegal in a function):

SET STATISTICS TIME ON;

WITH SelectionCounts

AS (

-- Number of rows to select from each group

-- Minimum 3, total for all groups = 25

-- Choose one of the possible combinations at random

SELECT TOP (1)

nc = C.n,

nt = T.n,

nr = R.n

FROM dbo.Number C,

dbo.Number T,

dbo.Number R

WHERE C.n BETWEEN 3 AND 19

AND T.n BETWEEN 3 AND 19

AND R.n BETWEEN 3 AND 19

AND C.n + T.n + R.n = 25

ORDER BY

NEWID()

),

GroupCounts

AS (

--Find the total number of vehicles of each type

SELECT vehicle_type,

group_size = COUNT_BIG(*)

FROM dbo.Source

WHERE vehicle_type = 'C'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*)

FROM dbo.Source

WHERE vehicle_type = 'T'

GROUP BY

vehicle_type

UNION ALL

SELECT vehicle_type,

group_size = COUNT_BIG(*)

FROM dbo.Source

WHERE vehicle_type = 'R'

GROUP BY

vehicle_type

),

Parameters

AS (

SELECT GC.group_size,

GC.vehicle_type,

sample_size =

CASE GC.vehicle_type

WHEN 'C' THEN SC.nc

WHEN 'T' THEN SC.nt

WHEN 'R' THEN SC.nr

ELSE 0

END

FROM GroupCounts GC,

SelectionCounts SC

),

RandomRows

AS (

-- Generate a random row number for each row required from

-- each group. Maximum rows returned = 25 * group_count

--

-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

-- force the query plan to calculate the random row number

-- before the final join

SELECT TOP (9223372036854775807)

P.vehicle_type,

Random.rn

FROM Parameters P

CROSS

APPLY (

SELECT TOP (P.sample_size)

rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * P.group_size + 1)

FROM dbo.Number N

ORDER BY

N.n ASC

) Random

ORDER BY

P.vehicle_type,

Random.rn

),

NumberedSourceRows

AS (

-- Number the source rows, restarting the numbering for each group

SELECT S.row_id,

S.vehicle_type,

S.vehicle_id,

rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

FROM dbo.Source S

)

-- Fetch the numbered rows that match the random row number, per group

SELECT NSR.row_id,

NSR.vehicle_type,

NSR.vehicle_id

FROM RandomRows RR

JOIN NumberedSourceRows NSR

ON NSR.vehicle_type = RR.vehicle_type

AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;

The above code returns the expected even distribution, and runs in an average of 94ms - exactly the same as the previous method.

Using a persisted table for one of the 153 possible combinations results in a slightly cleaner plan, but the CPU time remains at 94ms.

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150467 More actions · Answer 14

Jeff Moden (3/14/2010)
Yeah, but the MODE is flat for C... average shouldn't even come into play on this.

I do not count math among my stronger skills 🙂

The idea was that the average for C determines the average range for R...and so on. It might be complete nonsense...:unsure:

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi