

SSCEnthusiastic
Group: General Forum Members
Last Login: Monday, June 22, 2015 9:10 AM
Points: 184,
Visits: 364


I have a request to pull out random records from one table but atleast 3 records from each category without exceeding 25 records for all.
I know how to pull out the random records, but how can I pull out 3 random for each category?
example: table1 has 100 records 3 categories, (car, truck, RV) I need to pull out car records, truck records, and RV records, all random and no more then 25 in total and no less then 3 from each category,
is something like this possible with using 1 table like I have? I'm currently pulling a random record set from this table, but now I have to break it down by categories
this is the query i'm using to pull out the top 10 percent, no more then 25 and I'm updated an isSelected Field so its not picked again. How can I take this one step further and bring back at least 3 for each category?
SET ROWCOUNT 25 INSERT INTO @SelectedId SELECT TOP 10 percent cId FROM CarInformation where salesman= @SalesPerson and isSelected is Null and DateClosed is NOT NULL and [quarter] = @Qtr and [Year] = @Year ORDER BY NEWID()
UPDATE CarInformation SET isSelected = 1 FROM CarInformation AS F INNER JOIN @SelectedId AS S ON F.cId = S.cId SELECT * FROM CarInformation AS F INNER JOIN @SelectedId AS S ON F.clmId = S.cId




SSCForever
Group: General Forum Members
Last Login: Saturday, December 3, 2016 5:18 AM
Points: 45,619,
Visits: 44,147


SELECT <column list> FROM ( SELECT <column list>, ROW_NUMBER() OVER (Partition By Category Order By (SELECT NEWID())) AS RowNo FROM Table1) sub WHERE rowNo <=3
Gail Shaw Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci) SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
We walk in the dark places no others will enter We stand on the bridge and no one may pass




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 9:30 PM
Points: 42,060,
Visits: 39,443


Hmmmm.... this problem is coming up more and more... must be a new homework problem by a given teacher in some school or maybe a favorite "take home" interview question by a certain DBA.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCEnthusiastic
Group: General Forum Members
Last Login: Monday, June 22, 2015 9:10 AM
Points: 184,
Visits: 364


nope no homework, no interview questions, its an actualy business requirement for this application that was once an access app and everything was done manually, now i'm automating it and I need to duplicate the selecting of random records, where before a user went in and select 5 from one, 5 from another and so on for a total of 25, now I need to automate that portion of the new tool.




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, November 11, 2016 6:39 AM
Points: 9,932,
Visits: 11,346


As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value.
Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.
 Drop our test table if it exists IF OBJECT_ID(N'tempdb..#Source', N'U') IS NOT NULL DROP TABLE #Source; GO  Test table CREATE TABLE #Source ( row_id INTEGER IDENTITY PRIMARY KEY, vehicle_type CHAR(1) NOT NULL, vehicle_id INTEGER NOT NULL ); GO  Add 100,000 random rows, roughly evenly distributed  across vehicle types INSERT #Source ( vehicle_type, vehicle_id ) SELECT TOP (100000) CASE  Chance of being a car WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'  Chance of being a truck WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'  Chance of being an RV ELSE 'R' END,  Random vehicle id ABS(CHECKSUM(NEWID())) FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3; GO  A helpful index CREATE INDEX nc1 ON #Source (vehicle_type ASC, row_id ASC) INCLUDE (vehicle_id) WITH (FILLFACTOR = 100, MAXDOP = 1); GO  Show the number of vehicles of each type SELECT vehicle_type, row_count = COUNT_BIG(*) FROM #Source GROUP BY vehicle_type ORDER BY vehicle_type; GO SET STATISTICS TIME ON;
 Select a random sample of rows from each group  Minimum 3 rows, maximum 25, 10% of the group size othewise WITH SampleSize AS (  Find the total number of vehicles of each type  and the sample size to use for each group  (3 rows, 25 rows, or 10%) SELECT vehicle_type, group_size = COUNT_BIG(*), sample_size = CASE WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25 WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03 ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*)) END FROM #Source GROUP BY vehicle_type ), RandomRows AS (  Generate a random row number for each row required from  each group. Maximum rows returned = 25 * group_count   The TOP (9223372036854775807 (= BIGINT.Max)) is just to  force the query plan to calculate the random row number  before the final join SELECT TOP (9223372036854775807) SS.vehicle_type, Random.rn FROM SampleSize SS CROSS APPLY ( SELECT TOP (SS.sample_size) rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1) FROM master.sys.columns C1, master.sys.columns C2 ) Random ORDER BY SS.vehicle_type, Random.rn ), NumberedSourceRows AS (  Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM #Source S )  Fetch the numbered rows that match the random row number, per group SELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_id FROM RandomRows RR JOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn;
SET STATISTICS TIME OFF; GO IF OBJECT_ID(N'tempdb..#Source', N'U') IS NOT NULL DROP TABLE #Source; GO
Paul
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 9:30 PM
Points: 42,060,
Visits: 39,443


Paul White (3/12/2010)
As far as I can see, the previously posted code doesn't meet the requirement to select a minimum of three rows, a maximum of 25, and 10% of the group size otherwise. It is also rather inefficient for larger group sizes, since NEWID() is evaluated for every row, and the entire set must be sorted on that value. Here is my attempt, which selects the required rows from 100,000 random records (split into 3 groups) in 93ms on my old laptop.  Drop our test table if it exists IF OBJECT_ID(N'tempdb..#Source', N'U') IS NOT NULL DROP TABLE #Source; GO  Test table CREATE TABLE #Source ( row_id INTEGER IDENTITY PRIMARY KEY, vehicle_type CHAR(1) NOT NULL, vehicle_id INTEGER NOT NULL ); GO  Add 100,000 random rows, roughly evenly distributed  across vehicle types INSERT #Source ( vehicle_type, vehicle_id ) SELECT TOP (100000) CASE  Chance of being a car WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'C'  Chance of being a truck WHEN RAND(CHECKSUM(NEWID(), C1.[object_id])) <= 0.3333 THEN 'T'  Chance of being an RV ELSE 'R' END,  Random vehicle id ABS(CHECKSUM(NEWID())) FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3; GO  A helpful index CREATE INDEX nc1 ON #Source (vehicle_type ASC, row_id ASC) INCLUDE (vehicle_id) WITH (FILLFACTOR = 100, MAXDOP = 1); GO  Show the number of vehicles of each type SELECT vehicle_type, row_count = COUNT_BIG(*) FROM #Source GROUP BY vehicle_type ORDER BY vehicle_type; GO SET STATISTICS TIME ON;
 Select a random sample of rows from each group  Minimum 3 rows, maximum 25, 10% of the group size othewise WITH SampleSize AS (  Find the total number of vehicles of each type  and the sample size to use for each group  (3 rows, 25 rows, or 10%) SELECT vehicle_type, group_size = COUNT_BIG(*), sample_size = CASE WHEN 0.1 * COUNT_BIG(*) > 25 THEN 25 WHEN 0.1 * COUNT_BIG(*) < 03 THEN 03 ELSE CONVERT(BIGINT, 0.1 * COUNT_BIG(*)) END FROM #Source GROUP BY vehicle_type ), RandomRows AS (  Generate a random row number for each row required from  each group. Maximum rows returned = 25 * group_count   The TOP (9223372036854775807 (= BIGINT.Max)) is just to  force the query plan to calculate the random row number  before the final join SELECT TOP (9223372036854775807) SS.vehicle_type, Random.rn FROM SampleSize SS CROSS APPLY ( SELECT TOP (SS.sample_size) rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), C1.[object_id])) * group_size + 1) FROM master.sys.columns C1, master.sys.columns C2 ) Random ORDER BY SS.vehicle_type, Random.rn ), NumberedSourceRows AS (  Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM #Source S )  Fetch the numbered rows that match the random row number, per group SELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_id FROM RandomRows RR JOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn;
SET STATISTICS TIME OFF; GO IF OBJECT_ID(N'tempdb..#Source', N'U') IS NOT NULL DROP TABLE #Source; GO
Paul
Nicely done but...
without exceeding 25 records for all.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 9:30 PM
Points: 42,060,
Visits: 39,443


They key to simplicity (and, therefor, future maintainability) for this problem is that you have to know how many of each vehicle type to return BEFORE you try to select them. Using Paul's test data, the following meets the spirit of the problem definition...
SET STATISTICS TIME ON ; ===== Declare some variables to hold the counts for each vehicle type. DECLARE @CountC INT, @CountR INT, @CountT INT ; ===== Randomly assign the counts to be returned with the constraint that  no count must be less than 3 and the total count must be 25. SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3, @CountR = ABS(CHECKSUM(NEWID()))%(25@CountC6)+3, @CountT = 25@CountC@CountR ; ===== Make the random vehicle selection based on the counts and return everything  as a single randomly ordered result set. WITH cteC AS (SELECT TOP (@CountC) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'C' ORDER BY NEWID()), cteR AS (SELECT TOP (@CountR) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'R' ORDER BY NEWID()), cteT AS (SELECT TOP (@CountT) Vehicle_Type, Vehicle_ID FROM #Source WHERE vehicle_type = 'T' ORDER BY NEWID()), cteALL AS (SELECT Vehicle_Type, Vehicle_ID FROM cteC UNION ALL SELECT Vehicle_Type, Vehicle_ID FROM cteR UNION ALL SELECT Vehicle_Type, Vehicle_ID FROM cteT) SELECT Vehicle_Type, Vehicle_ID FROM cteAll ORDER BY NEWID() ; SET STATISTICS TIME OFF ;
In the formulas... The number "25" is the maximum number of rows to return for all categories combined. The number "16" is the number 25 minus 9 (the minimum number of rows (3) to return for each category times the number of categories (3)). The number "3" is the minimum number of rows to return for each category. The number "6" is minimum number of rows (3) you have to return for the 2 remaining categores.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, November 11, 2016 6:39 AM
Points: 9,932,
Visits: 11,346


Jeff Moden (3/13/2010)
Nicely done but... without exceeding 25 records for all. Well that is just embarrassing.
Your solution has some nice features, particularly in the calculation of counts.
It does raise a question, though: in which order should we allocate? You have chosen 'cars', then 'trucks', then 'RVs'  but that seems a bit arbitrary.
My second point concerns performance. The code performs three partial scans of the index, and three sorts. CPU time on my machine: 188ms  averaged over ten runs.
Combining the two approaches reduces the average time to 94ms:
SET STATISTICS TIME ON;
===== Declare some variables to hold the counts for each vehicle type. DECLARE @CountC INT, @CountR INT, @CountT INT; ===== Randomly assign the counts to be returned with the constraint that  no count must be less than 3 and the total count must be 25. SELECT @CountC = ABS(CHECKSUM(NEWID()))%16+3, @CountR = ABS(CHECKSUM(NEWID()))%(25@CountC6)+3, @CountT = 25@CountC@CountR;
WITH SampleSize AS ( Find the total number of vehicles of each type and the sample size to use for each group (3 rows, 25 rows, or 10%) SELECT vehicle_type, group_size = COUNT_BIG(*), sample_size = @CountC FROM #Source WHERE vehicle_type = 'C' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*), sample_size = @CountT FROM #Source WHERE vehicle_type = 'T' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*), sample_size = @CountR FROM #Source WHERE vehicle_type = 'R' GROUP BY vehicle_type ), RandomRows AS (  Generate a random row number for each row required from  each group. Maximum rows returned = 25 * group_count   The TOP (9223372036854775807 (= BIGINT.Max)) is just to  force the query plan to calculate the random row number  before the final join SELECT TOP (9223372036854775807) SS.vehicle_type, Random.rn FROM SampleSize SS CROSS APPLY ( SELECT TOP (SS.sample_size) rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * SS.group_size + 1) FROM dbo.Number N ORDER BY N.n ASC ) Random ORDER BY SS.vehicle_type, Random.rn ), NumberedSourceRows AS (  Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM #Source S )  Fetch the numbered rows that match the random row number, per group SELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_id FROM RandomRows RR JOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn;
SET STATISTICS TIME OFF;
For what it's worth, the minimum3 maximum25 from each group problem was much more fun!
Paul
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 9:30 PM
Points: 42,060,
Visits: 39,443


Paul White (3/13/2010) It does raise a question, though: in which order should we allocate? You have chosen 'cars', then 'trucks', then 'RVs'  but that seems a bit arbitrary.
I initially thought of that, too. But I believe that there will be a nice balance to the 3 insofar as which will have the largest numbers... the first random selection has just as much a chance of being very low as it does very high. Same goes for the others. During some very minor testing, for example, it did produce 3,3,16 a couple of times. I've not done an actual "distribution" test yet to see if it's more random than not, but mathematically, it would seem so. Guess I'll have to do a test because, heh... one test is worth a thousand "expert opinions" (especially mine ).
My second point concerns performance. The code performs three partial scans of the index, and three sorts. CPU time on my machine: 188ms  averaged over ten runs.
Combining the two approaches reduces the average time to 94ms:
I was thinking the same thing afterwards. Considering that I just did a review on your Cross Apply article, the notion struck me that your method and mine could be combined for the best of both worlds, but you beat me to it. You're much better at the use of Cross Apply and it seems to just come natural while I still have to think about it. It'll become more natural for me as I use it more, but you're still light years ahead of me on its usage.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 9:30 PM
Points: 42,060,
Visits: 39,443


Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers. Here's the test code if you'd like to play...
WITH cteC AS ( === Determine "C" SELECT CountC = ABS(CHECKSUM(NEWID()))%16+3 FROM dbo.Tally t WHERE t.N <= 10000 ), cteR AS ( === Determine "R" constrained by "C" SELECT CountC, CountR = ABS(CHECKSUM(NEWID()))%(25CountC6)+3 FROM cteC ), cteT AS ( === Determine "T" constrained by "C" and "R" SELECT CountC, CountR, CountT = 25CountCCountR FROM cteR ) === Unpivot the 3 columns of data and do counts of occurances for each Count"x" value. SELECT unpvt.CountName, unpvt.CountX, COUNT(*) AS Occurances FROM ( SELECT CountC,CountR,CountT FROM cteT ) pvt UNPIVOT ( CountX FOR CountName IN (CountC,CountR,CountT) )AS unpvt GROUP BY unpvt.CountName, unpvt.CountX ORDER BY unpvt.CountName, unpvt.CountX
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems



