Recent PostsRecent Posts Popular TopicsPopular Topics
 Home Search Members Calendar Who's On

 select random records but for each category Rate Topic Display Mode Topic Options
Author
 Message
 Posted Sunday, March 14, 2010 1:04 PM
 SSC-Dedicated Group: General Forum Members Last Login: Yesterday @ 5:22 PM Points: 36,016, Visits: 30,308
 Ok.... so far, I'm at a total loss as to why the formulas I wrote don't have a more even distibution. Sooooo.... I'll use the power of the computer to overcome my apparent lack of mathematical prowess...If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...Here's the code for the "helper" table as above...`--===== Create a new permanent table with precalculated -- vehicle counts with distributions that always -- have at least a count of 3 per vehicle type and -- have a sum across the vehicle types in each row -- that add up to exactly 25. SELECT IDENTITY(INT,0,1) AS VehicleTypeCountID, tc.N AS CarCount, tr.N AS RVCount, tt.N AS TruckCount INTO dbo.VehicleTypeCount FROM dbo.Tally tc CROSS JOIN dbo.Tally tr CROSS JOIN dbo.Tally tt WHERE tc.N BETWEEN 3 AND 19 AND tr.N BETWEEN 3 AND 19 AND tt.N BETWEEN 3 AND 19 AND tc.N + tr.N + tt.N = 25 ORDER BY CarCount, RVCount, TruckCount --Just for "sanity" checks--===== Add the very necessary clustered index (PK in this case) -- which will make finding a random selection very fast. ALTER TABLE dbo.VehicleTypeCount ADD CONSTRAINT PK_VehicleTypeCount PRIMARY KEY CLUSTERED (VehicleTypeCountID`... and here's the code to make the assignments to the 3 variables we've been using...`--===== Demonstrate the random selection of distributed countsDECLARE @CountC INT, @CountR INT, @CountT INT SELECT @CountC = CarCount, @CountR = RVCount, @CountT = TruckCount FROM dbo.VehicleTypeCount WHERE VehicleTypeCountID = ABS(CHECKSUM(NEWID())) % 153 --Number of rows in table--===== Display the content of the variables for the "warm fuzzies" SELECT @CountC, @CountR, @CountT` --Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column." "Change is inevitable. Change for the better is not." -- 04 August 2013(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013Helpful Links:How to post code problemsHow to post performance problems
Post #882612
 Posted Sunday, March 14, 2010 8:16 PM
 SSChampion Group: General Forum Members Last Login: Yesterday @ 7:35 PM Points: 11,168, Visits: 10,940
 Jeff Moden (3/14/2010)Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers.Nice UNPIVOT there Jeff About the distribution:C takes a random integer value between 3 and 19. The average value of C is Σ(3...19) / 16 = 11R can now be any random integer value between 3 and (19-C).Using the average value of 11 for C, the range of R is (3...11) on average. Σ(3...11) / 9 = 7T ranges from 3 to (25 - R - C) => (25 - 7 - 11) => 7. The range of T is therefore (3...7) on average, and it has an average value of 5.Running your test rig over the first million integers (1...1000000), the averages, as reported by the AVG aggregate are:C = 10R = 6T = 7Paul Paul WhiteSQL Server MVPSQLblog.com@SQL_Kiwi
Post #882661
 Posted Sunday, March 14, 2010 9:29 PM
 SSC-Dedicated Group: General Forum Members Last Login: Yesterday @ 5:22 PM Points: 36,016, Visits: 30,308
 Yeah, but the MODE is flat for C... average shouldn't even come into play on this. --Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column." "Change is inevitable. Change for the better is not." -- 04 August 2013(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013Helpful Links:How to post code problemsHow to post performance problems
Post #882674
 Posted Sunday, March 14, 2010 9:30 PM
 SSChampion Group: General Forum Members Last Login: Yesterday @ 7:35 PM Points: 11,168, Visits: 10,940
 Jeff Moden (3/14/2010)If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...I like this idea a lot. Keeping it all in one query (though it is a shame that NEWID is illegal in a function):`SET STATISTICS TIME ON;WITH SelectionCountsAS ( -- Number of rows to select from each group -- Minimum 3, total for all groups = 25 -- Choose one of the possible combinations at random SELECT TOP (1) nc = C.n, nt = T.n, nr = R.n FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25 ORDER BY NEWID() ), GroupCountsAS ( --Find the total number of vehicles of each type SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'C' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'T' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'R' GROUP BY vehicle_type ), ParametersAS ( SELECT GC.group_size, GC.vehicle_type, sample_size = CASE GC.vehicle_type WHEN 'C' THEN SC.nc WHEN 'T' THEN SC.nt WHEN 'R' THEN SC.nr ELSE 0 END FROM GroupCounts GC, SelectionCounts SC ), RandomRowsAS ( -- Generate a random row number for each row required from -- each group. Maximum rows returned = 25 * group_count -- -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to -- force the query plan to calculate the random row number -- before the final join SELECT TOP (9223372036854775807) P.vehicle_type, Random.rn FROM Parameters P CROSS APPLY ( SELECT TOP (P.sample_size) rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * P.group_size + 1) FROM dbo.Number N ORDER BY N.n ASC ) Random ORDER BY P.vehicle_type, Random.rn ), NumberedSourceRowsAS ( -- Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM dbo.Source S )-- Fetch the numbered rows that match the random row number, per groupSELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_idFROM RandomRows RRJOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn;SET STATISTICS TIME OFF;`The above code returns the expected even distribution, and runs in an average of 94ms - exactly the same as the previous method.Using a persisted table for one of the 153 possible combinations results in a slightly cleaner plan, but the CPU time remains at 94ms.Paul Paul WhiteSQL Server MVPSQLblog.com@SQL_Kiwi
Post #882675
 Posted Sunday, March 14, 2010 9:33 PM
 SSChampion Group: General Forum Members Last Login: Yesterday @ 7:35 PM Points: 11,168, Visits: 10,940
 Jeff Moden (3/14/2010)Yeah, but the MODE is flat for C... average shouldn't even come into play on this.I do not count math among my stronger skills The idea was that the average for C determines the average range for R...and so on. It might be complete nonsense... Paul WhiteSQL Server MVPSQLblog.com@SQL_Kiwi
Post #882676
 Posted Sunday, March 14, 2010 9:56 PM
 SSC-Dedicated Group: General Forum Members Last Login: Yesterday @ 5:22 PM Points: 36,016, Visits: 30,308
 Paul White (3/14/2010)(though it is a shame that NEWID is illegal in a function):Old gaming advice... if you can't move forward, backward, left, right, up, down, then change "color". You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.` CREATE VIEW MyNewID AS SELECT NEWID() AS ANewID` --Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column." "Change is inevitable. Change for the better is not." -- 04 August 2013(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013Helpful Links:How to post code problemsHow to post performance problems
Post #882687
 Posted Sunday, March 14, 2010 10:24 PM
 SSChampion Group: General Forum Members Last Login: Yesterday @ 7:35 PM Points: 11,168, Visits: 10,940
 Jeff Moden (3/14/2010)You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.Thanks Jeff! I had read that before, but did not make the connection. It will complicate the SELECT a bit, but I will have a go...The function will be non-deterministic of course, but that's kinda the point I suppose.Shame though, because the optimizer really hates non-deterministic things. Paul WhiteSQL Server MVPSQLblog.com@SQL_Kiwi
Post #882702
 Posted Sunday, March 14, 2010 10:58 PM
 SSC-Dedicated Group: General Forum Members Last Login: Yesterday @ 5:22 PM Points: 36,016, Visits: 30,308
 But not too complicated...` SELECT TOP (1) nc = C.n, nt = T.n, nr = R.n FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25 ORDER BY (SELECT ANewID FROM dbo.MyNewID) --This used to be NEWID()` --Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column." "Change is inevitable. Change for the better is not." -- 04 August 2013(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013Helpful Links:How to post code problemsHow to post performance problems
Post #882714
 Posted Sunday, March 14, 2010 11:09 PM
 SSChampion Group: General Forum Members Last Login: Yesterday @ 7:35 PM Points: 11,168, Visits: 10,940
 Jeff Moden (3/14/2010)But not too complicated...There is another one inside the CROSS APPLY. Not difficult either, I am just hoping the query plan holds together. Paul WhiteSQL Server MVPSQLblog.com@SQL_Kiwi
Post #882718
 Posted Sunday, March 14, 2010 11:55 PM
 SSChampion Group: General Forum Members Last Login: Yesterday @ 7:35 PM Points: 11,168, Visits: 10,940
 OK, all done. I had to include the RAND in the wrapper view too, for the same reason.Here is the full script:`USE tempdb;-- Drop the in-line functionIF OBJECT_ID(N'dbo.GetSourceSample', N'IF') IS NOT NULL DROP FUNCTION dbo.GetSourceSampleGO -- Drop the view wrapping NEWIDIF OBJECT_ID(N'dbo.Random', N'V') IS NOT NULL DROP VIEW dbo.Random;GO -- Drop the combinations tableIF OBJECT_ID(N'dbo.Combinations', N'U') IS NOT NULL DROP TABLE dbo.Combinations;GO-- Drop our test tableIF OBJECT_ID(N'dbo.Source', N'U') IS NOT NULL DROP TABLE dbo.Source;GO-- Create numbers table if requiredIF OBJECT_ID(N'dbo.Number', N'U') IS NULLBEGIN CREATE TABLE dbo.Number (n SMALLINT PRIMARY KEY); -- 736 SMALLINTs exactly fill one 8K page INSERT dbo.Number (n) SELECT TOP (736) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3;END;GO-- Create the test tableCREATE TABLE dbo.Source ( row_id INTEGER IDENTITY PRIMARY KEY, vehicle_type CHAR(1) NOT NULL, vehicle_id INTEGER NOT NULL );GO-- View wrapping NEWIDCREATE VIEW dbo.RandomWITH SCHEMABINDING AS SELECT n = RAND(CHECKSUM(NEWID()));GO-- All possible combinations of values [3...19]-- that sum to 25SELECT row_id = IDENTITY(INT, 0, 1), nc = C.n, nt = T.n, nr = R.nINTO dbo.CombinationsFROM dbo.Number C, dbo.Number T, dbo.Number RWHERE C.n BETWEEN 3 AND 19AND T.n BETWEEN 3 AND 19AND R.n BETWEEN 3 AND 19AND C.n + T.n + R.n = 25-- Create clustered primary keyALTER TABLE dbo.CombinationsADD PRIMARY KEY (row_id)WITH (FILLFACTOR = 100);GO-- Add 100,000 random rows, roughly evenly distributed-- across vehicle typesINSERT dbo.Source ( vehicle_type, vehicle_id )SELECT TOP (100000) CASE -- Chance of being a car WHEN RAND(CHECKSUM(NEWID(), N1.n)) <= 0.3333 THEN 'C' -- Chance of being a truck WHEN RAND(CHECKSUM(NEWID(), N2.n)) <= 0.3333 THEN 'T' -- Chance of being an RV ELSE 'R' END, -- Random vehicle id ABS(CHECKSUM(NEWID()))FROM dbo.Number N1, dbo.Number N2, dbo.Number N3;GO-- A helpful indexCREATE UNIQUE INDEX [UQ dbo.Source vehicle_type, row_id (vehicle_id)]ON dbo.Source (vehicle_type ASC, row_id ASC) INCLUDE (vehicle_id) WITH (FILLFACTOR = 100);GO-- Show the number of vehicles of each typeSELECT vehicle_type, row_count = COUNT_BIG(*)FROM dbo.Source GROUP BY vehicle_typeORDER BY vehicle_type;GO-- Create the functionCREATE FUNCTION dbo.GetSourceSample ()RETURNS TABLEWITH SCHEMABINDINGAS RETURN WITH SelectionCounts AS ( -- Number of rows to select from each group -- Minimum 3, total for all groups = 25 -- Choose one of the possible combinations at random SELECT TOP (1) nc = C.n, nt = T.n, nr = R.n FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25 ORDER BY (SELECT R.n FROM dbo.Random R) ), GroupCounts AS ( --Find the total number of vehicles of each type SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'C' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'T' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'R' GROUP BY vehicle_type ), Parameters AS ( SELECT GC.group_size, GC.vehicle_type, sample_size = CASE GC.vehicle_type WHEN 'C' THEN SC.nc WHEN 'T' THEN SC.nt WHEN 'R' THEN SC.nr ELSE 0 END FROM GroupCounts GC, SelectionCounts SC ), RandomRows AS ( -- Generate a random row number for each row required from -- each group. Maximum rows returned = 25 * group_count -- -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to -- force the query plan to calculate the random row number -- before the final join SELECT TOP (9223372036854775807) P.vehicle_type, Random.rn FROM Parameters P CROSS APPLY ( SELECT TOP (P.sample_size) rn = CONVERT(BIGINT, R.n * P.group_size + 1) FROM dbo.Number N, dbo.Random R ORDER BY N.n ASC ) Random ORDER BY P.vehicle_type, Random.rn ), NumberedSourceRows AS ( -- Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM dbo.Source S ) -- Fetch the numbered rows that match the random row number, per group SELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_id FROM RandomRows RR JOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn;GOSET STATISTICS TIME ON;SELECT GSS.row_id, GSS.vehicle_type, GSS.vehicle_idFROM dbo.GetSourceSample() GSS;SET STATISTICS TIME OFF;GO-- Drop the in-line functionIF OBJECT_ID(N'dbo.GetSourceSample', N'IF') IS NOT NULL DROP FUNCTION dbo.GetSourceSample -- Drop the view wrapping NEWIDIF OBJECT_ID(N'dbo.Random', N'V') IS NOT NULL DROP VIEW dbo.Random; -- Drop the combinations tableIF OBJECT_ID(N'dbo.Combinations', N'U') IS NOT NULL DROP TABLE dbo.Combinations;-- Drop our test tableIF OBJECT_ID(N'dbo.Source', N'U') IS NOT NULL DROP TABLE dbo.Source;`Plan stayed exactly the same, 94ms average CPU time. Paul WhiteSQL Server MVPSQLblog.com@SQL_Kiwi
Post #882723

 Permissions