Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««1234»»»

select random records but for each category Expand / Collapse
Author
Message
Posted Sunday, March 14, 2010 1:04 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 5:22 PM
Points: 37,061, Visits: 31,624
Ok.... so far, I'm at a total loss as to why the formulas I wrote don't have a more even distibution. Sooooo.... I'll use the power of the computer to overcome my apparent lack of mathematical prowess...

If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...

Here's the code for the "helper" table as above...
--===== Create a new permanent table with precalculated 
-- vehicle counts with distributions that always
-- have at least a count of 3 per vehicle type and
-- have a sum across the vehicle types in each row
-- that add up to exactly 25.
SELECT IDENTITY(INT,0,1) AS VehicleTypeCountID,
tc.N AS CarCount,
tr.N AS RVCount,
tt.N AS TruckCount
INTO dbo.VehicleTypeCount
FROM dbo.Tally tc
CROSS JOIN dbo.Tally tr
CROSS JOIN dbo.Tally tt
WHERE tc.N BETWEEN 3 AND 19
AND tr.N BETWEEN 3 AND 19
AND tt.N BETWEEN 3 AND 19
AND tc.N + tr.N + tt.N = 25
ORDER BY CarCount, RVCount, TruckCount --Just for "sanity" checks

--===== Add the very necessary clustered index (PK in this case)
-- which will make finding a random selection very fast.
ALTER TABLE dbo.VehicleTypeCount
ADD CONSTRAINT PK_VehicleTypeCount
PRIMARY KEY CLUSTERED (VehicleTypeCountID


... and here's the code to make the assignments to the 3 variables we've been using...
--===== Demonstrate the random selection of distributed counts
DECLARE @CountC INT,
@CountR INT,
@CountT INT

SELECT @CountC = CarCount,
@CountR = RVCount,
@CountT = TruckCount
FROM dbo.VehicleTypeCount
WHERE VehicleTypeCountID = ABS(CHECKSUM(NEWID())) % 153 --Number of rows in table

--===== Display the content of the variables for the "warm fuzzies"
SELECT @CountC, @CountR, @CountT




--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882612
Posted Sunday, March 14, 2010 8:16 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: 2 days ago @ 8:01 AM
Points: 11,194, Visits: 11,165
Jeff Moden (3/14/2010)
Holy Moly! The distribution of "R" and "T" suck with my method! I don't quite understand it because the distribution of "C" is nearly flat as you would expect with random numbers.

Nice UNPIVOT there Jeff

About the distribution:

C takes a random integer value between 3 and 19. The average value of C is Σ(3...19) / 16 = 11

R can now be any random integer value between 3 and (19-C).
Using the average value of 11 for C, the range of R is (3...11) on average. Σ(3...11) / 9 = 7

T ranges from 3 to (25 - R - C) => (25 - 7 - 11) => 7. The range of T is therefore (3...7) on average, and it has an average value of 5.

Running your test rig over the first million integers (1...1000000), the averages, as reported by the AVG aggregate are:

C = 10
R = 6
T = 7

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882661
Posted Sunday, March 14, 2010 9:29 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 5:22 PM
Points: 37,061, Visits: 31,624
Yeah, but the MODE is flat for C... average shouldn't even come into play on this.

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882674
Posted Sunday, March 14, 2010 9:30 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: 2 days ago @ 8:01 AM
Points: 11,194, Visits: 11,165
Jeff Moden (3/14/2010)
If we precalculate all possible combinations of numbers that have a value of at least "3" and save only the ones that add up to precisely "25" in a table, then a random selection on that table will give a very nice, very even distibution of the 153 different combinations that add up to "25". It's no different than having any other "helper" table (like a Tally or Calendar table) where you don't have to keep recalcalculating the same thing over and over...

I like this idea a lot. Keeping it all in one query (though it is a shame that NEWID is illegal in a function):

SET STATISTICS TIME ON;

WITH SelectionCounts
AS (
-- Number of rows to select from each group
-- Minimum 3, total for all groups = 25
-- Choose one of the possible combinations at random
SELECT TOP (1)
nc = C.n,
nt = T.n,
nr = R.n
FROM dbo.Number C,
dbo.Number T,
dbo.Number R
WHERE C.n BETWEEN 3 AND 19
AND T.n BETWEEN 3 AND 19
AND R.n BETWEEN 3 AND 19
AND C.n + T.n + R.n = 25
ORDER BY
NEWID()
),
GroupCounts
AS (
--Find the total number of vehicles of each type
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'C'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'T'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'R'
GROUP BY
vehicle_type
),
Parameters
AS (
SELECT GC.group_size,
GC.vehicle_type,
sample_size =
CASE GC.vehicle_type
WHEN 'C' THEN SC.nc
WHEN 'T' THEN SC.nt
WHEN 'R' THEN SC.nr
ELSE 0
END
FROM GroupCounts GC,
SelectionCounts SC
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
P.vehicle_type,
Random.rn
FROM Parameters P
CROSS
APPLY (
SELECT TOP (P.sample_size)
rn = CONVERT(BIGINT, RAND(CHECKSUM(NEWID(), N.n)) * P.group_size + 1)
FROM dbo.Number N
ORDER BY
N.n ASC
) Random
ORDER BY
P.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM dbo.Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;

SET STATISTICS TIME OFF;

The above code returns the expected even distribution, and runs in an average of 94ms - exactly the same as the previous method.
Using a persisted table for one of the 153 possible combinations results in a slightly cleaner plan, but the CPU time remains at 94ms.

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882675
Posted Sunday, March 14, 2010 9:33 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: 2 days ago @ 8:01 AM
Points: 11,194, Visits: 11,165
Jeff Moden (3/14/2010)
Yeah, but the MODE is flat for C... average shouldn't even come into play on this.

I do not count math among my stronger skills
The idea was that the average for C determines the average range for R...and so on. It might be complete nonsense...




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882676
Posted Sunday, March 14, 2010 9:56 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 5:22 PM
Points: 37,061, Visits: 31,624
Paul White (3/14/2010)
(though it is a shame that NEWID is illegal in a function):


Old gaming advice... if you can't move forward, backward, left, right, up, down, then change "color".

You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.

 CREATE VIEW MyNewID
AS
SELECT NEWID() AS ANewID




--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882687
Posted Sunday, March 14, 2010 10:24 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: 2 days ago @ 8:01 AM
Points: 11,194, Visits: 11,165
Jeff Moden (3/14/2010)
You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.

Thanks Jeff! I had read that before, but did not make the connection. It will complicate the SELECT a bit, but I will have a go...
The function will be non-deterministic of course, but that's kinda the point I suppose.
Shame though, because the optimizer really hates non-deterministic things.




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882702
Posted Sunday, March 14, 2010 10:58 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 5:22 PM
Points: 37,061, Visits: 31,624
But not too complicated...

        SELECT  TOP (1)
nc = C.n,
nt = T.n,
nr = R.n
FROM dbo.Number C,
dbo.Number T,
dbo.Number R
WHERE C.n BETWEEN 3 AND 19
AND T.n BETWEEN 3 AND 19
AND R.n BETWEEN 3 AND 19
AND C.n + T.n + R.n = 25
ORDER BY
(SELECT ANewID FROM dbo.MyNewID) --This used to be NEWID()




--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #882714
Posted Sunday, March 14, 2010 11:09 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: 2 days ago @ 8:01 AM
Points: 11,194, Visits: 11,165
Jeff Moden (3/14/2010)
But not too complicated...

There is another one inside the CROSS APPLY. Not difficult either, I am just hoping the query plan holds together.




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882718
Posted Sunday, March 14, 2010 11:55 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: 2 days ago @ 8:01 AM
Points: 11,194, Visits: 11,165
OK, all done. I had to include the RAND in the wrapper view too, for the same reason.

Here is the full script:
USE tempdb;

-- Drop the in-line function
IF OBJECT_ID(N'dbo.GetSourceSample', N'IF')
IS NOT NULL
DROP FUNCTION dbo.GetSourceSample
GO
-- Drop the view wrapping NEWID
IF OBJECT_ID(N'dbo.Random', N'V')
IS NOT NULL
DROP VIEW dbo.Random;
GO
-- Drop the combinations table
IF OBJECT_ID(N'dbo.Combinations', N'U')
IS NOT NULL
DROP TABLE dbo.Combinations;
GO
-- Drop our test table
IF OBJECT_ID(N'dbo.Source', N'U')
IS NOT NULL
DROP TABLE dbo.Source;
GO
-- Create numbers table if required
IF OBJECT_ID(N'dbo.Number', N'U')
IS NULL
BEGIN
CREATE TABLE dbo.Number (n SMALLINT PRIMARY KEY);

-- 736 SMALLINTs exactly fill one 8K page
INSERT dbo.Number (n)
SELECT TOP (736)
ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM master.sys.columns C1,
master.sys.columns C2,
master.sys.columns C3;
END;
GO
-- Create the test table
CREATE TABLE dbo.Source
(
row_id INTEGER IDENTITY PRIMARY KEY,
vehicle_type CHAR(1) NOT NULL,
vehicle_id INTEGER NOT NULL
);
GO
-- View wrapping NEWID
CREATE VIEW dbo.Random
WITH SCHEMABINDING
AS SELECT n = RAND(CHECKSUM(NEWID()));
GO
-- All possible combinations of values [3...19]
-- that sum to 25
SELECT row_id = IDENTITY(INT, 0, 1),
nc = C.n,
nt = T.n,
nr = R.n
INTO dbo.Combinations
FROM dbo.Number C,
dbo.Number T,
dbo.Number R
WHERE C.n BETWEEN 3 AND 19
AND T.n BETWEEN 3 AND 19
AND R.n BETWEEN 3 AND 19
AND C.n + T.n + R.n = 25

-- Create clustered primary key
ALTER TABLE dbo.Combinations
ADD PRIMARY KEY (row_id)
WITH (FILLFACTOR = 100);
GO
-- Add 100,000 random rows, roughly evenly distributed
-- across vehicle types
INSERT dbo.Source
(
vehicle_type,
vehicle_id
)
SELECT TOP (100000)
CASE
-- Chance of being a car
WHEN RAND(CHECKSUM(NEWID(), N1.n)) <= 0.3333 THEN 'C'
-- Chance of being a truck
WHEN RAND(CHECKSUM(NEWID(), N2.n)) <= 0.3333 THEN 'T'
-- Chance of being an RV
ELSE 'R'
END,
-- Random vehicle id
ABS(CHECKSUM(NEWID()))
FROM dbo.Number N1,
dbo.Number N2,
dbo.Number N3;
GO
-- A helpful index
CREATE UNIQUE INDEX [UQ dbo.Source vehicle_type, row_id (vehicle_id)]
ON dbo.Source
(vehicle_type ASC, row_id ASC)
INCLUDE (vehicle_id)
WITH (FILLFACTOR = 100);
GO
-- Show the number of vehicles of each type
SELECT vehicle_type,
row_count = COUNT_BIG(*)
FROM dbo.Source
GROUP BY
vehicle_type
ORDER BY
vehicle_type;
GO
-- Create the function
CREATE FUNCTION dbo.GetSourceSample ()
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
WITH SelectionCounts
AS (
-- Number of rows to select from each group
-- Minimum 3, total for all groups = 25
-- Choose one of the possible combinations at random
SELECT TOP (1)
nc = C.n,
nt = T.n,
nr = R.n
FROM dbo.Number C,
dbo.Number T,
dbo.Number R
WHERE C.n BETWEEN 3 AND 19
AND T.n BETWEEN 3 AND 19
AND R.n BETWEEN 3 AND 19
AND C.n + T.n + R.n = 25
ORDER BY
(SELECT R.n FROM dbo.Random R)
),
GroupCounts
AS (
--Find the total number of vehicles of each type
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'C'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'T'
GROUP BY
vehicle_type
UNION ALL
SELECT vehicle_type,
group_size = COUNT_BIG(*)
FROM dbo.Source
WHERE vehicle_type = 'R'
GROUP BY
vehicle_type
),
Parameters
AS (
SELECT GC.group_size,
GC.vehicle_type,
sample_size =
CASE GC.vehicle_type
WHEN 'C' THEN SC.nc
WHEN 'T' THEN SC.nt
WHEN 'R' THEN SC.nr
ELSE 0
END
FROM GroupCounts GC,
SelectionCounts SC
),
RandomRows
AS (
-- Generate a random row number for each row required from
-- each group. Maximum rows returned = 25 * group_count
--
-- The TOP (9223372036854775807 (= BIGINT.Max)) is just to
-- force the query plan to calculate the random row number
-- before the final join
SELECT TOP (9223372036854775807)
P.vehicle_type,
Random.rn
FROM Parameters P
CROSS
APPLY (
SELECT TOP (P.sample_size)
rn = CONVERT(BIGINT, R.n * P.group_size + 1)
FROM dbo.Number N,
dbo.Random R
ORDER BY
N.n ASC
) Random
ORDER BY
P.vehicle_type,
Random.rn
),
NumberedSourceRows
AS (
-- Number the source rows, restarting the numbering for each group
SELECT S.row_id,
S.vehicle_type,
S.vehicle_id,
rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)
FROM dbo.Source S
)
-- Fetch the numbered rows that match the random row number, per group
SELECT NSR.row_id,
NSR.vehicle_type,
NSR.vehicle_id
FROM RandomRows RR
JOIN NumberedSourceRows NSR
ON NSR.vehicle_type = RR.vehicle_type
AND NSR.rn = RR.rn;
GO
SET STATISTICS TIME ON;

SELECT GSS.row_id,
GSS.vehicle_type,
GSS.vehicle_id
FROM dbo.GetSourceSample() GSS;

SET STATISTICS TIME OFF;
GO
-- Drop the in-line function
IF OBJECT_ID(N'dbo.GetSourceSample', N'IF')
IS NOT NULL
DROP FUNCTION dbo.GetSourceSample

-- Drop the view wrapping NEWID
IF OBJECT_ID(N'dbo.Random', N'V')
IS NOT NULL
DROP VIEW dbo.Random;

-- Drop the combinations table
IF OBJECT_ID(N'dbo.Combinations', N'U')
IS NOT NULL
DROP TABLE dbo.Combinations;

-- Drop our test table
IF OBJECT_ID(N'dbo.Source', N'U')
IS NOT NULL
DROP TABLE dbo.Source;

Plan stayed exactly the same, 94ms average CPU time.




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #882723
« Prev Topic | Next Topic »

Add to briefcase ««1234»»»

Permissions Expand / Collapse