

SSCForever
Group: General Forum Members
Last Login: Yesterday @ 8:48 PM
Points: 40,994,
Visits: 38,293


Ok.... so far, I'm at a total loss as to why the formulas I wrote don't have a more even distibution. Sooooo.... I'll use the power of the computer to overcome my apparent lack of mathematical prowess...
If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...
Here's the code for the "helper" table as above...
===== Create a new permanent table with precalculated  vehicle counts with distributions that always  have at least a count of 3 per vehicle type and  have a sum across the vehicle types in each row  that add up to exactly 25. SELECT IDENTITY(INT,0,1) AS VehicleTypeCountID, tc.N AS CarCount, tr.N AS RVCount, tt.N AS TruckCount INTO dbo.VehicleTypeCount FROM dbo.Tally tc CROSS JOIN dbo.Tally tr CROSS JOIN dbo.Tally tt WHERE tc.N BETWEEN 3 AND 19 AND tr.N BETWEEN 3 AND 19 AND tt.N BETWEEN 3 AND 19 AND tc.N + tr.N + tt.N = 25 ORDER BY CarCount, RVCount, TruckCount Just for "sanity" checks
===== Add the very necessary clustered index (PK in this case)  which will make finding a random selection very fast. ALTER TABLE dbo.VehicleTypeCount ADD CONSTRAINT PK_VehicleTypeCount PRIMARY KEY CLUSTERED (VehicleTypeCountID
... and here's the code to make the assignments to the 3 variables we've been using...
===== Demonstrate the random selection of distributed counts DECLARE @CountC INT, @CountR INT, @CountT INT
SELECT @CountC = CarCount, @CountR = RVCount, @CountT = TruckCount FROM dbo.VehicleTypeCount WHERE VehicleTypeCountID = ABS(CHECKSUM(NEWID())) % 153 Number of rows in table
===== Display the content of the variables for the "warm fuzzies" SELECT @CountC, @CountR, @CountT
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, July 22, 2016 6:18 PM
Points: 9,932,
Visits: 11,342


Jeff Moden (3/14/2010) Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers. Nice UNPIVOT there Jeff
About the distribution:
C takes a random integer value between 3 and 19. The average value of C is Σ(3...19) / 16 = 11
R can now be any random integer value between 3 and (19C). Using the average value of 11 for C, the range of R is (3...11) on average. Σ(3...11) / 9 = 7
T ranges from 3 to (25  R  C) => (25  7  11) => 7. The range of T is therefore (3...7) on average, and it has an average value of 5.
Running your test rig over the first million integers (1...1000000), the averages, as reported by the AVG aggregate are:
C = 10 R = 6 T = 7
Paul
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 8:48 PM
Points: 40,994,
Visits: 38,293


Yeah, but the MODE is flat for C... average shouldn't even come into play on this.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, July 22, 2016 6:18 PM
Points: 9,932,
Visits: 11,342


Jeff Moden (3/14/2010) If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over... I like this idea a lot. Keeping it all in one query (though it is a shame that NEWID is illegal in a function):
SET STATISTICS TIME ON;
WITH SelectionCounts AS (  Number of rows to select from each group  Minimum 3, total for all groups = 25  Choose one of the possible combinations at random SELECT TOP (1) nc = C.n, nt = T.n, nr = R.n FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25 ORDER BY NEWID() ), GroupCounts AS ( Find the total number of vehicles of each type SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'C' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'T' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'R' GROUP BY vehicle_type ), Parameters AS ( SELECT GC.group_size, GC.vehicle_type, sample_size = CASE GC.vehicle_type WHEN 'C' THEN SC.nc WHEN 'T' THEN SC.nt WHEN 'R' THEN SC.nr ELSE 0 END FROM GroupCounts GC, SelectionCounts SC ), RandomRows AS (  Generate a random row number for each row required from  each group. Maximum rows returned = 25 * group_count   The TOP (9223372036854775807 (= BIGINT.Max)) is just to  force the query plan to calculate the random row number  before the final join SELECT TOP (9223372036854775807) P.vehicle_type, Random.rn FROM Parameters P CROSS APPLY ( SELECT TOP (P.sample_size) rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * P.group_size + 1) FROM dbo.Number N ORDER BY N.n ASC ) Random ORDER BY P.vehicle_type, Random.rn ), NumberedSourceRows AS (  Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM dbo.Source S )  Fetch the numbered rows that match the random row number, per group SELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_id FROM RandomRows RR JOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn;
SET STATISTICS TIME OFF;
The above code returns the expected even distribution, and runs in an average of 94ms  exactly the same as the previous method. Using a persisted table for one of the 153 possible combinations results in a slightly cleaner plan, but the CPU time remains at 94ms.
Paul
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, July 22, 2016 6:18 PM
Points: 9,932,
Visits: 11,342


Jeff Moden (3/14/2010) Yeah, but the MODE is flat for C... average shouldn't even come into play on this. I do not count math among my stronger skills The idea was that the average for C determines the average range for R...and so on. It might be complete nonsense...
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 8:48 PM
Points: 40,994,
Visits: 38,293


Paul White (3/14/2010) (though it is a shame that NEWID is illegal in a function):
Old gaming advice... if you can't move forward, backward, left, right, up, down, then change "color".
You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.
CREATE VIEW MyNewID AS SELECT NEWID() AS ANewID
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, July 22, 2016 6:18 PM
Points: 9,932,
Visits: 11,342


Jeff Moden (3/14/2010) You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF. Thanks Jeff! I had read that before, but did not make the connection. It will complicate the SELECT a bit, but I will have a go... The function will be nondeterministic of course, but that's kinda the point I suppose. Shame though, because the optimizer really hates nondeterministic things.
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 8:48 PM
Points: 40,994,
Visits: 38,293


But not too complicated...
SELECT TOP (1) nc = C.n, nt = T.n, nr = R.n FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25 ORDER BY (SELECT ANewID FROM dbo.MyNewID) This used to be NEWID()
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, July 22, 2016 6:18 PM
Points: 9,932,
Visits: 11,342


Jeff Moden (3/14/2010) But not too complicated... There is another one inside the CROSS APPLY. Not difficult either, I am just hoping the query plan holds together.
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi




SSCrazy Eights
Group: General Forum Members
Last Login: Friday, July 22, 2016 6:18 PM
Points: 9,932,
Visits: 11,342


OK, all done. I had to include the RAND in the wrapper view too, for the same reason.
Here is the full script:
USE tempdb;
 Drop the inline function IF OBJECT_ID(N'dbo.GetSourceSample', N'IF') IS NOT NULL DROP FUNCTION dbo.GetSourceSample GO  Drop the view wrapping NEWID IF OBJECT_ID(N'dbo.Random', N'V') IS NOT NULL DROP VIEW dbo.Random; GO  Drop the combinations table IF OBJECT_ID(N'dbo.Combinations', N'U') IS NOT NULL DROP TABLE dbo.Combinations; GO  Drop our test table IF OBJECT_ID(N'dbo.Source', N'U') IS NOT NULL DROP TABLE dbo.Source; GO  Create numbers table if required IF OBJECT_ID(N'dbo.Number', N'U') IS NULL BEGIN CREATE TABLE dbo.Number (n SMALLINT PRIMARY KEY);
 736 SMALLINTs exactly fill one 8K page INSERT dbo.Number (n) SELECT TOP (736) ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM master.sys.columns C1, master.sys.columns C2, master.sys.columns C3; END; GO  Create the test table CREATE TABLE dbo.Source ( row_id INTEGER IDENTITY PRIMARY KEY, vehicle_type CHAR(1) NOT NULL, vehicle_id INTEGER NOT NULL ); GO  View wrapping NEWID CREATE VIEW dbo.Random WITH SCHEMABINDING AS SELECT n = RAND(CHECKSUM(NEWID())); GO  All possible combinations of values [3...19]  that sum to 25 SELECT row_id = IDENTITY(INT, 0, 1), nc = C.n, nt = T.n, nr = R.n INTO dbo.Combinations FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25
 Create clustered primary key ALTER TABLE dbo.Combinations ADD PRIMARY KEY (row_id) WITH (FILLFACTOR = 100); GO  Add 100,000 random rows, roughly evenly distributed  across vehicle types INSERT dbo.Source ( vehicle_type, vehicle_id ) SELECT TOP (100000) CASE  Chance of being a car WHEN RAND(CHECKSUM(NEWID(), N1.n)) <= 0.3333 THEN 'C'  Chance of being a truck WHEN RAND(CHECKSUM(NEWID(), N2.n)) <= 0.3333 THEN 'T'  Chance of being an RV ELSE 'R' END,  Random vehicle id ABS(CHECKSUM(NEWID())) FROM dbo.Number N1, dbo.Number N2, dbo.Number N3; GO  A helpful index CREATE UNIQUE INDEX [UQ dbo.Source vehicle_type, row_id (vehicle_id)] ON dbo.Source (vehicle_type ASC, row_id ASC) INCLUDE (vehicle_id) WITH (FILLFACTOR = 100); GO  Show the number of vehicles of each type SELECT vehicle_type, row_count = COUNT_BIG(*) FROM dbo.Source GROUP BY vehicle_type ORDER BY vehicle_type; GO  Create the function CREATE FUNCTION dbo.GetSourceSample () RETURNS TABLE WITH SCHEMABINDING AS RETURN WITH SelectionCounts AS (  Number of rows to select from each group  Minimum 3, total for all groups = 25  Choose one of the possible combinations at random SELECT TOP (1) nc = C.n, nt = T.n, nr = R.n FROM dbo.Number C, dbo.Number T, dbo.Number R WHERE C.n BETWEEN 3 AND 19 AND T.n BETWEEN 3 AND 19 AND R.n BETWEEN 3 AND 19 AND C.n + T.n + R.n = 25 ORDER BY (SELECT R.n FROM dbo.Random R) ), GroupCounts AS ( Find the total number of vehicles of each type SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'C' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'T' GROUP BY vehicle_type UNION ALL SELECT vehicle_type, group_size = COUNT_BIG(*) FROM dbo.Source WHERE vehicle_type = 'R' GROUP BY vehicle_type ), Parameters AS ( SELECT GC.group_size, GC.vehicle_type, sample_size = CASE GC.vehicle_type WHEN 'C' THEN SC.nc WHEN 'T' THEN SC.nt WHEN 'R' THEN SC.nr ELSE 0 END FROM GroupCounts GC, SelectionCounts SC ), RandomRows AS (  Generate a random row number for each row required from  each group. Maximum rows returned = 25 * group_count   The TOP (9223372036854775807 (= BIGINT.Max)) is just to  force the query plan to calculate the random row number  before the final join SELECT TOP (9223372036854775807) P.vehicle_type, Random.rn FROM Parameters P CROSS APPLY ( SELECT TOP (P.sample_size) rn = CONVERT(BIGINT, R.n * P.group_size + 1) FROM dbo.Number N, dbo.Random R ORDER BY N.n ASC ) Random ORDER BY P.vehicle_type, Random.rn ), NumberedSourceRows AS (  Number the source rows, restarting the numbering for each group SELECT S.row_id, S.vehicle_type, S.vehicle_id, rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC) FROM dbo.Source S )  Fetch the numbered rows that match the random row number, per group SELECT NSR.row_id, NSR.vehicle_type, NSR.vehicle_id FROM RandomRows RR JOIN NumberedSourceRows NSR ON NSR.vehicle_type = RR.vehicle_type AND NSR.rn = RR.rn; GO SET STATISTICS TIME ON;
SELECT GSS.row_id, GSS.vehicle_type, GSS.vehicle_id FROM dbo.GetSourceSample() GSS;
SET STATISTICS TIME OFF; GO  Drop the inline function IF OBJECT_ID(N'dbo.GetSourceSample', N'IF') IS NOT NULL DROP FUNCTION dbo.GetSourceSample  Drop the view wrapping NEWID IF OBJECT_ID(N'dbo.Random', N'V') IS NOT NULL DROP VIEW dbo.Random;  Drop the combinations table IF OBJECT_ID(N'dbo.Combinations', N'U') IS NOT NULL DROP TABLE dbo.Combinations;
 Drop our test table IF OBJECT_ID(N'dbo.Source', N'U') IS NOT NULL DROP TABLE dbo.Source;
Plan stayed exactly the same, 94ms average CPU time.
Paul White SQLPerformance.com SQLblog.com @SQL_Kiwi



