SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


select random records but for each category


select random records but for each category

Author
Message
SQL_NuB
SQL_NuB
Right there with Babe
Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)

Group: General Forum Members
Points: 722 Visits: 364
I have a request to pull out random records from one table but atleast 3 records from each category without exceeding 25 records for all.

I know how to pull out the random records, but how can I pull out 3 random for each category?

example:
table1 has 100 records
3 categories, (car, truck, RV)
I need to pull out car records, truck records, and RV records, all random and no more then 25 in total and no less then 3 from each category,

is something like this possible with using 1 table like I have? I'm currently pulling a random record set from this table, but now I have to break it down by categories

this is the query i'm using to pull out the top 10 percent, no more then 25 and I'm updated an isSelected Field so its not picked again. How can I take this one step further and bring back at least 3 for each category?


SET ROWCOUNT 25
INSERT INTO @SelectedId
SELECT TOP 10 percent cId
FROM CarInformation
where salesman= @SalesPerson and isSelected is Null and DateClosed is NOT NULL and [quarter] = @Qtr and [Year] = @Year

ORDER BY NEWID()

UPDATE CarInformation
SET isSelected = 1
FROM CarInformation AS F
INNER JOIN @SelectedId AS S
ON F.cId = S.cId
SELECT * FROM CarInformation AS F
INNER JOIN @SelectedId AS S ON F.clmId = S.cId
GilaMonster
GilaMonster
SSC Guru
SSC Guru (212K reputation)SSC Guru (212K reputation)SSC Guru (212K reputation)SSC Guru (212K reputation)SSC Guru (212K reputation)SSC Guru (212K reputation)SSC Guru (212K reputation)SSC Guru (212K reputation)

Group: General Forum Members
Points: 212015 Visits: 46259
SELECT <column list> FROM (
SELECT <column list>, ROW_NUMBER() OVER (Partition By Category Order By (SELECT NEWID())) AS RowNo
FROM Table1) sub
WHERE rowNo <=3



Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass


Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)

Group: General Forum Members
Points: 204090 Visits: 41949
Hmmmm.... this problem is coming up more and more... must be a new homework problem by a given teacher in some school or maybe a favorite "take home" interview question by a certain DBA.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
SQL_NuB
SQL_NuB
Right there with Babe
Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)Right there with Babe (722 reputation)

Group: General Forum Members
Points: 722 Visits: 364
nope no homework, no interview questions, its an actualy business requirement for this application that was once an access app and everything was done manually, now i'm automating it and I need to duplicate the selecting of random records, where before a user went in and select 5 from one, 5 from another and so on for a total of 25, now I need to automate that portion of the new tool.
Paul White
Paul White
SSC-Dedicated
SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)

Group: General Forum Members
Points: 33724 Visits: 11359
As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.


-- Drop our test table if it exists
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO
-- Test table
CREATE TABLE #Source
(
row_id INTEGER IDENTITY PRIMARY KEY,
vehicle_type CHAR(1) NOT NULL,
vehicle_id INTEGER NOT NULL
);
GO
-- Add 100,000 random rows, roughly evenly distributed
-- across vehicle types
INSERT #Source
(
vehicle_type,
vehicle_id
)
SELECT TOP (100000)
CASE
-- Chance of being a car
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'
-- Chance of being a truck
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'
-- Chance of being an RV
ELSE 'R'
END,
-- Random vehicle id
ABS(CHECKSUM(NEWID()))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
-- A helpful index
CREATE INDEX nc1
ON #Source
(vehicle_type ASC, row_id ASC)
INCLUDE (vehicle_id)
WITH (FILLFACTOR = 100, MAXDOP = 1);
GO
-- Show the number of vehicles of each type
SELECT vehicle_type,
row_count = COUNT_BIG(*)
FROM #Source
GROUP BY
vehicle_type
ORDER BY
vehicle_type;
GO
SET STATISTICS TIME ON;

-- Select a random sample of rows from each group
-- Minimum 3 rows, maximum 25, 10% of the group size othewise
WITH SampleSize
AS (
-- Find the total number of vehicles of each type
-- and the sample size to use for each group
-- (3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size =
CASE
WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25
WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03
ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))
END
FROM #Source
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)
FROM master.sys.columns C1,
master.sys.columns C2
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;
GO
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO



Paul



Paul White
SQLPerformance.com
SQLblog.com
@SQL_Kiwi
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)

Group: General Forum Members
Points: 204090 Visits: 41949
Paul White (3/12/2010)
As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.


-- Drop our test table if it exists
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO
-- Test table
CREATE TABLE #Source
(
row_id INTEGER IDENTITY PRIMARY KEY,
vehicle_type CHAR(1) NOT NULL,
vehicle_id INTEGER NOT NULL
);
GO
-- Add 100,000 random rows, roughly evenly distributed
-- across vehicle types
INSERT #Source
(
vehicle_type,
vehicle_id
)
SELECT TOP (100000)
CASE
-- Chance of being a car
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'
-- Chance of being a truck
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'
-- Chance of being an RV
ELSE 'R'
END,
-- Random vehicle id
ABS(CHECKSUM(NEWID()))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
-- A helpful index
CREATE INDEX nc1
ON #Source
(vehicle_type ASC, row_id ASC)
INCLUDE (vehicle_id)
WITH (FILLFACTOR = 100, MAXDOP = 1);
GO
-- Show the number of vehicles of each type
SELECT vehicle_type,
row_count = COUNT_BIG(*)
FROM #Source
GROUP BY
vehicle_type
ORDER BY
vehicle_type;
GO
SET STATISTICS TIME ON;

-- Select a random sample of rows from each group
-- Minimum 3 rows, maximum 25, 10% of the group size othewise
WITH SampleSize
AS (
-- Find the total number of vehicles of each type
-- and the sample size to use for each group
-- (3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size =
CASE
WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25
WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03
ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))
END
FROM #Source
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)
FROM master.sys.columns C1,
master.sys.columns C2
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;
GO
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO



Paul


Nicely done but...

without exceeding 25 records for all.


--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)

Group: General Forum Members
Points: 204090 Visits: 41949
They key to simplicity (and, therefor, future maintainability) for this problem is that you have to know how many of each vehicle type to return BEFORE you try to select them. Using Paul's test data, the following meets the spirit of the problem definition...

SET STATISTICS TIME ON
;
--===== Declare some variables to hold the counts for each vehicle type.
DECLARE @CountC INT,
@CountR INT,
@CountT INT
;
--===== Randomly assign the counts to be returned with the constraint that
-- no count must be less than 3 and the total count must be 25.
SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,
@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,
@CountT = 25-@CountC-@CountR
;
--===== Make the random vehicle selection based on the counts and return everything
-- as a single randomly ordered result set.
WITH
cteC AS (SELECT TOP (@CountC) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'C' ORDER BY NEWID()),
cteR AS (SELECT TOP (@CountR) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'R' ORDER BY NEWID()),
cteT AS (SELECT TOP (@CountT) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'T' ORDER BY NEWID()),
cteALL AS (SELECT Vehicle_Type, Vehicle_ID FROM cteC UNION ALL
SELECT Vehicle_Type, Vehicle_ID FROM cteR UNION ALL
SELECT Vehicle_Type, Vehicle_ID FROM cteT)
SELECT Vehicle_Type, Vehicle_ID
FROM cteAll
ORDER BY NEWID()
;
SET STATISTICS TIME OFF
;



In the formulas...
The number "25" is the maximum number of rows to return for all categories combined.
The number "16" is the number 25 minus 9 (the minimum number of rows (3) to return for each category times the number of categories (3)).
The number "3" is the minimum number of rows to return for each category.
The number "6" is minimum number of rows (3) you have to return for the 2 remaining categores.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Paul White
Paul White
SSC-Dedicated
SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)

Group: General Forum Members
Points: 33724 Visits: 11359
Jeff Moden (3/13/2010)
Nicely done but...
without exceeding 25 records for all.

Well that is just embarrassing. Blush

Your solution has some nice features, particularly in the calculation of counts.

It does raise a question, though: in which order should we allocate?
You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

My second point concerns performance.
The code performs three partial scans of the index, and three sorts.
CPU time on my machine: 188ms - averaged over ten runs.

Combining the two approaches reduces the average time to 94ms:


SET STATISTICS TIME ON;

--===== Declare some variables to hold the counts for each vehicle type.
DECLARE @CountC INT,
@CountR INT,
@CountT INT;

--===== Randomly assign the counts to be returned with the constraint that
-- no count must be less than 3 and the total count must be 25.
SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,
@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,
@CountT = 25-@CountC-@CountR;

WITH SampleSize
AS (
--Find the total number of vehicles of each type
--and the sample size to use for each group
--(3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size = @CountC
FROM #Source
WHERE vehicle_type = 'C'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size = @CountT
FROM #Source
WHERE vehicle_type = 'T'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size = @CountR
FROM #Source
WHERE vehicle_type = 'R'
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * SS.group_size + 1)
FROM dbo.Number N
ORDER BY
N.n ASC
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;



For what it's worth, the minimum-3 maximum-25 from each group problem was much more fun!

Paul



Paul White
SQLPerformance.com
SQLblog.com
@SQL_Kiwi
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)

Group: General Forum Members
Points: 204090 Visits: 41949
Paul White (3/13/2010)
It does raise a question, though: in which order should we allocate?
You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.


I initially thought of that, too. But I believe that there will be a nice balance to the 3 insofar as which will have the largest numbers... the first random selection has just as much a chance of being very low as it does very high. Same goes for the others. During some very minor testing, for example, it did produce 3,3,16 a couple of times. I've not done an actual "distribution" test yet to see if it's more random than not, but mathematically, it would seem so. Guess I'll have to do a test because, heh... one test is worth a thousand "expert opinions" (especially mine :-P).



My second point concerns performance.
The code performs three partial scans of the index, and three sorts.
CPU time on my machine: 188ms - averaged over ten runs.

Combining the two approaches reduces the average time to 94ms:


I was thinking the same thing afterwards. Considering that I just did a review on your Cross Apply article, the notion struck me that your method and mine could be combined for the best of both worlds, but you beat me to it. You're much better at the use of Cross Apply and it seems to just come natural while I still have to think about it. Blush It'll become more natural for me as I use it more, but you're still light years ahead of me on its usage.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)SSC Guru (204K reputation)

Group: General Forum Members
Points: 204090 Visits: 41949
Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers. Here's the test code if you'd like to play...

WITH
cteC AS
( --=== Determine "C"
SELECT CountC = ABS(CHECKSUM(NEWID()))%16+3
FROM dbo.Tally t
WHERE t.N <= 10000
),
cteR AS
( --=== Determine "R" constrained by "C"
SELECT CountC,
CountR = ABS(CHECKSUM(NEWID()))%(25-CountC-6)+3
FROM cteC
),
cteT AS
( --=== Determine "T" constrained by "C" and "R"
SELECT CountC,
CountR,
CountT = 25-CountC-CountR
FROM cteR
) --=== Unpivot the 3 columns of data and do counts of occurances for each Count"x" value.
SELECT unpvt.CountName, unpvt.CountX, COUNT(*) AS Occurances
FROM (
SELECT CountC,CountR,CountT
FROM cteT
) pvt
UNPIVOT (
CountX FOR CountName IN (CountC,CountR,CountT)
)AS unpvt
GROUP BY unpvt.CountName, unpvt.CountX
ORDER BY unpvt.CountName, unpvt.CountX




--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search