Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 1234»»»

select random records but for each category Expand / Collapse
Author
Message
Posted Thursday, March 11, 2010 1:19 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, July 23, 2013 11:21 AM
Points: 176, Visits: 346
I have a request to pull out random records from one table but atleast 3 records from each category without exceeding 25 records for all.

I know how to pull out the random records, but how can I pull out 3 random for each category?

example:
table1 has 100 records
3 categories, (car, truck, RV)
I need to pull out car records, truck records, and RV records, all random and no more then 25 in total and no less then 3 from each category,

is something like this possible with using 1 table like I have? I'm currently pulling a random record set from this table, but now I have to break it down by categories

this is the query i'm using to pull out the top 10 percent, no more then 25 and I'm updated an isSelected Field so its not picked again. How can I take this one step further and bring back at least 3 for each category?


SET ROWCOUNT 25
INSERT INTO @SelectedId
SELECT TOP 10 percent cId
FROM CarInformation
where salesman= @SalesPerson and isSelected is Null and DateClosed is NOT NULL and [quarter] = @Qtr and [Year] = @Year

ORDER BY NEWID()

UPDATE CarInformation
SET isSelected = 1
FROM CarInformation AS F
INNER JOIN @SelectedId AS S
ON F.cId = S.cId
SELECT * FROM CarInformation AS F
INNER JOIN @SelectedId AS S ON F.clmId = S.cId
Post #881301
Posted Thursday, March 11, 2010 3:01 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 5:45 AM
Points: 39,981, Visits: 36,349
SELECT <column list> FROM (
SELECT <column list>, ROW_NUMBER() OVER (Partition By Category Order By (SELECT NEWID())) AS RowNo
FROM Table1) sub
WHERE rowNo <=3




Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #881367
Posted Thursday, March 11, 2010 9:44 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 11:18 PM
Points: 35,267, Visits: 31,759
Hmmmm.... this problem is coming up more and more... must be a new homework problem by a given teacher in some school or maybe a favorite "take home" interview question by a certain DBA.

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #881481
Posted Friday, March 12, 2010 5:04 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, July 23, 2013 11:21 AM
Points: 176, Visits: 346
nope no homework, no interview questions, its an actualy business requirement for this application that was once an access app and everything was done manually, now i'm automating it and I need to duplicate the selecting of random records, where before a user went in and select 5 from one, 5 from another and so on for a total of 25, now I need to automate that portion of the new tool.
Post #881653
Posted Friday, March 12, 2010 9:29 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Monday, September 29, 2014 10:07 PM
Points: 9,926, Visits: 11,183
As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.

-- Drop our test table if it exists
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO
-- Test table
CREATE TABLE #Source
(
row_id INTEGER IDENTITY PRIMARY KEY,
vehicle_type CHAR(1) NOT NULL,
vehicle_id INTEGER NOT NULL
);
GO
-- Add 100,000 random rows, roughly evenly distributed
-- across vehicle types
INSERT #Source
(
vehicle_type,
vehicle_id
)
SELECT TOP (100000)
CASE
-- Chance of being a car
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'
-- Chance of being a truck
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'
-- Chance of being an RV
ELSE 'R'
END,
-- Random vehicle id
ABS(CHECKSUM(NEWID()))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
-- A helpful index
CREATE INDEX nc1
ON #Source
(vehicle_type ASC, row_id ASC)
INCLUDE (vehicle_id)
WITH (FILLFACTOR = 100, MAXDOP = 1);
GO
-- Show the number of vehicles of each type
SELECT vehicle_type,
row_count = COUNT_BIG(*)
FROM #Source
GROUP BY
vehicle_type
ORDER BY
vehicle_type;
GO
SET STATISTICS TIME ON;

-- Select a random sample of rows from each group
-- Minimum 3 rows, maximum 25, 10% of the group size othewise
WITH SampleSize
AS (
-- Find the total number of vehicles of each type
-- and the sample size to use for each group
-- (3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size =
CASE
WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25
WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03
ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))
END
FROM #Source
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)
FROM master.sys.columns C1,
master.sys.columns C2
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;
GO
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882261
Posted Saturday, March 13, 2010 8:58 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 11:18 PM
Points: 35,267, Visits: 31,759
Paul White (3/12/2010)
As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.

Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.

-- Drop our test table if it exists
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO
-- Test table
CREATE TABLE #Source
(
row_id INTEGER IDENTITY PRIMARY KEY,
vehicle_type CHAR(1) NOT NULL,
vehicle_id INTEGER NOT NULL
);
GO
-- Add 100,000 random rows, roughly evenly distributed
-- across vehicle types
INSERT #Source
(
vehicle_type,
vehicle_id
)
SELECT TOP (100000)
CASE
-- Chance of being a car
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'
-- Chance of being a truck
WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'
-- Chance of being an RV
ELSE 'R'
END,
-- Random vehicle id
ABS(CHECKSUM(NEWID()))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
GO
-- A helpful index
CREATE INDEX nc1
ON #Source
(vehicle_type ASC, row_id ASC)
INCLUDE (vehicle_id)
WITH (FILLFACTOR = 100, MAXDOP = 1);
GO
-- Show the number of vehicles of each type
SELECT vehicle_type,
row_count = COUNT_BIG(*)
FROM #Source
GROUP BY
vehicle_type
ORDER BY
vehicle_type;
GO
SET STATISTICS TIME ON;

-- Select a random sample of rows from each group
-- Minimum 3 rows, maximum 25, 10% of the group size othewise
WITH SampleSize
AS (
-- Find the total number of vehicles of each type
-- and the sample size to use for each group
-- (3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size =
CASE
WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25
WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03
ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*))
END
FROM #Source
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1)
FROM master.sys.columns C1,
master.sys.columns C2
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;
GO
IF OBJECT_ID(N'tempdb..#Source', N'U')
IS NOT NULL
DROP TABLE #Source;
GO

Paul


Nicely done but...

without exceeding 25 records for all.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882368
Posted Saturday, March 13, 2010 9:45 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 11:18 PM
Points: 35,267, Visits: 31,759
They key to simplicity (and, therefor, future maintainability) for this problem is that you have to know how many of each vehicle type to return BEFORE you try to select them. Using Paul's test data, the following meets the spirit of the problem definition...

SET STATISTICS TIME ON
;
--===== Declare some variables to hold the counts for each vehicle type.
DECLARE @CountC INT,
@CountR INT,
@CountT INT
;
--===== Randomly assign the counts to be returned with the constraint that
-- no count must be less than 3 and the total count must be 25.
SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,
@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,
@CountT = 25-@CountC-@CountR
;
--===== Make the random vehicle selection based on the counts and return everything
-- as a single randomly ordered result set.
WITH
cteC AS (SELECT TOP (@CountC) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'C' ORDER BY NEWID()),
cteR AS (SELECT TOP (@CountR) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'R' ORDER BY NEWID()),
cteT AS (SELECT TOP (@CountT) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'T' ORDER BY NEWID()),
cteALL AS (SELECT Vehicle_Type, Vehicle_ID FROM cteC UNION ALL
SELECT Vehicle_Type, Vehicle_ID FROM cteR UNION ALL
SELECT Vehicle_Type, Vehicle_ID FROM cteT)
SELECT Vehicle_Type, Vehicle_ID
FROM cteAll
ORDER BY NEWID()
;
SET STATISTICS TIME OFF
;


In the formulas...
The number "25" is the maximum number of rows to return for all categories combined.
The number "16" is the number 25 minus 9 (the minimum number of rows (3) to return for each category times the number of categories (3)).
The number "3" is the minimum number of rows to return for each category.
The number "6" is minimum number of rows (3) you have to return for the 2 remaining categores.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882378
Posted Saturday, March 13, 2010 8:33 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Monday, September 29, 2014 10:07 PM
Points: 9,926, Visits: 11,183
Jeff Moden (3/13/2010)
Nicely done but...
without exceeding 25 records for all.

Well that is just embarrassing.

Your solution has some nice features, particularly in the calculation of counts.

It does raise a question, though: in which order should we allocate?
You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.

My second point concerns performance.
The code performs three partial scans of the index, and three sorts.
CPU time on my machine: 188ms - averaged over ten runs.

Combining the two approaches reduces the average time to 94ms:

SET STATISTICS TIME ON;

--===== Declare some variables to hold the counts for each vehicle type.
DECLARE @CountC INT,
@CountR INT,
@CountT INT;

--===== Randomly assign the counts to be returned with the constraint that
-- no count must be less than 3 and the total count must be 25.
SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3,
@CountR = ABS(CHECKSUM(NEWID()))%(25-@CountC-6)+3,
@CountT = 25-@CountC-@CountR;

WITH SampleSize
AS (
--Find the total number of vehicles of each type
--and the sample size to use for each group
--(3 rows, 25 rows, or 10%)
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size = @CountC
FROM #Source
WHERE vehicle_type = 'C'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size = @CountT
FROM #Source
WHERE vehicle_type = 'T'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*),
sample_size = @CountR
FROM #Source
WHERE vehicle_type = 'R'
GROUP BY
vehicle_type
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
SS.vehicle_type,
Random.rn
FROM SampleSize SS
CROSS
APPLY (
SELECT TOP (SS.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * SS.group_size + 1)
FROM dbo.Number N
ORDER BY
N.n ASC
) Random
ORDER BY
SS.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM #Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;

For what it's worth, the minimum-3 maximum-25 from each group problem was much more fun!

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882465
Posted Sunday, March 14, 2010 10:34 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 11:18 PM
Points: 35,267, Visits: 31,759
Paul White (3/13/2010)
It does raise a question, though: in which order should we allocate?
You have chosen 'cars', then 'trucks', then 'RVs' - but that seems a bit arbitrary.


I initially thought of that, too. But I believe that there will be a nice balance to the 3 insofar as which will have the largest numbers... the first random selection has just as much a chance of being very low as it does very high. Same goes for the others. During some very minor testing, for example, it did produce 3,3,16 a couple of times. I've not done an actual "distribution" test yet to see if it's more random than not, but mathematically, it would seem so. Guess I'll have to do a test because, heh... one test is worth a thousand "expert opinions" (especially mine ).



My second point concerns performance.
The code performs three partial scans of the index, and three sorts.
CPU time on my machine: 188ms - averaged over ten runs.

Combining the two approaches reduces the average time to 94ms:


I was thinking the same thing afterwards. Considering that I just did a review on your Cross Apply article, the notion struck me that your method and mine could be combined for the best of both worlds, but you beat me to it. You're much better at the use of Cross Apply and it seems to just come natural while I still have to think about it. It'll become more natural for me as I use it more, but you're still light years ahead of me on its usage.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882587
Posted Sunday, March 14, 2010 11:29 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 11:18 PM
Points: 35,267, Visits: 31,759
Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers. Here's the test code if you'd like to play...

WITH
cteC AS
( --=== Determine "C"
SELECT CountC = ABS(CHECKSUM(NEWID()))%16+3
FROM dbo.Tally t
WHERE t.N <= 10000
),
cteR AS
( --=== Determine "R" constrained by "C"
SELECT CountC,
CountR = ABS(CHECKSUM(NEWID()))%(25-CountC-6)+3
FROM cteC
),
cteT AS
( --=== Determine "T" constrained by "C" and "R"
SELECT CountC,
CountR,
CountT = 25-CountC-CountR
FROM cteR
) --=== Unpivot the 3 columns of data and do counts of occurances for each Count"x" value.
SELECT unpvt.CountName, unpvt.CountX, COUNT(*) AS Occurances
FROM (
SELECT CountC,CountR,CountT
FROM cteT
) pvt
UNPIVOT (
CountX FOR CountName IN (CountC,CountR,CountT)
)AS unpvt
GROUP BY unpvt.CountName, unpvt.CountX
ORDER BY unpvt.CountName, unpvt.CountX




--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882599
« Prev Topic | Next Topic »

Add to briefcase 1234»»»

Permissions Expand / Collapse