select random records but for each category

  • Paul White (3/14/2010)


    (though it is a shame that NEWID is illegal in a function):

    Old gaming advice... if you can't move forward, backward, left, right, up, down, then change "color". 😉

    You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.

    CREATE VIEW MyNewID

    AS

    SELECT NEWID() AS ANewID

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/14/2010)


    You can access a VIEW from a UDF. One like the following allows for the use of NEWID() in a function... Same goes for GETDATE() and a couple of other things you supposedly can't use in a UDF.

    Thanks Jeff! I had read that before, but did not make the connection. It will complicate the SELECT a bit, but I will have a go...

    The function will be non-deterministic of course, but that's kinda the point I suppose.

    Shame though, because the optimizer really hates non-deterministic things.

  • But not too complicated...

    SELECT TOP (1)

    nc = C.n,

    nt = T.n,

    nr = R.n

    FROM dbo.Number C,

    dbo.Number T,

    dbo.Number R

    WHERE C.n BETWEEN 3 AND 19

    AND T.n BETWEEN 3 AND 19

    AND R.n BETWEEN 3 AND 19

    AND C.n + T.n + R.n = 25

    ORDER BY

    (SELECT ANewID FROM dbo.MyNewID) --This used to be NEWID()

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/14/2010)


    But not too complicated...

    There is another one inside the CROSS APPLY. Not difficult either, I am just hoping the query plan holds together.

  • OK, all done. I had to include the RAND in the wrapper view too, for the same reason.

    Here is the full script:

    USE tempdb;

    -- Drop the in-line function

    IF OBJECT_ID(N'dbo.GetSourceSample', N'IF')

    IS NOT NULL

    DROP FUNCTION dbo.GetSourceSample

    GO

    -- Drop the view wrapping NEWID

    IF OBJECT_ID(N'dbo.Random', N'V')

    IS NOT NULL

    DROP VIEW dbo.Random;

    GO

    -- Drop the combinations table

    IF OBJECT_ID(N'dbo.Combinations', N'U')

    IS NOT NULL

    DROP TABLE dbo.Combinations;

    GO

    -- Drop our test table

    IF OBJECT_ID(N'dbo.Source', N'U')

    IS NOT NULL

    DROP TABLE dbo.Source;

    GO

    -- Create numbers table if required

    IF OBJECT_ID(N'dbo.Number', N'U')

    IS NULL

    BEGIN

    CREATE TABLE dbo.Number (n SMALLINT PRIMARY KEY);

    -- 736 SMALLINTs exactly fill one 8K page

    INSERT dbo.Number (n)

    SELECT TOP (736)

    ROW_NUMBER() OVER (ORDER BY (SELECT 0))

    FROM master.sys.columns C1,

    master.sys.columns C2,

    master.sys.columns C3;

    END;

    GO

    -- Create the test table

    CREATE TABLE dbo.Source

    (

    row_id INTEGER IDENTITY PRIMARY KEY,

    vehicle_type CHAR(1) NOT NULL,

    vehicle_id INTEGER NOT NULL

    );

    GO

    -- View wrapping NEWID

    CREATE VIEW dbo.Random

    WITH SCHEMABINDING

    AS SELECT n = RAND(CHECKSUM(NEWID()));

    GO

    -- All possible combinations of values [3...19]

    -- that sum to 25

    SELECT row_id = IDENTITY(INT, 0, 1),

    nc = C.n,

    nt = T.n,

    nr = R.n

    INTO dbo.Combinations

    FROM dbo.Number C,

    dbo.Number T,

    dbo.Number R

    WHERE C.n BETWEEN 3 AND 19

    AND T.n BETWEEN 3 AND 19

    AND R.n BETWEEN 3 AND 19

    AND C.n + T.n + R.n = 25

    -- Create clustered primary key

    ALTER TABLE dbo.Combinations

    ADD PRIMARY KEY (row_id)

    WITH (FILLFACTOR = 100);

    GO

    -- Add 100,000 random rows, roughly evenly distributed

    -- across vehicle types

    INSERT dbo.Source

    (

    vehicle_type,

    vehicle_id

    )

    SELECT TOP (100000)

    CASE

    -- Chance of being a car

    WHEN RAND(CHECKSUM(NEWID(), N1.n)) <= 0.3333 THEN 'C'

    -- Chance of being a truck

    WHEN RAND(CHECKSUM(NEWID(), N2.n)) <= 0.3333 THEN 'T'

    -- Chance of being an RV

    ELSE 'R'

    END,

    -- Random vehicle id

    ABS(CHECKSUM(NEWID()))

    FROM dbo.Number N1,

    dbo.Number N2,

    dbo.Number N3;

    GO

    -- A helpful index

    CREATE UNIQUE INDEX [UQ dbo.Source vehicle_type, row_id (vehicle_id)]

    ON dbo.Source

    (vehicle_type ASC, row_id ASC)

    INCLUDE (vehicle_id)

    WITH (FILLFACTOR = 100);

    GO

    -- Show the number of vehicles of each type

    SELECT vehicle_type,

    row_count = COUNT_BIG(*)

    FROM dbo.Source

    GROUP BY

    vehicle_type

    ORDER BY

    vehicle_type;

    GO

    -- Create the function

    CREATE FUNCTION dbo.GetSourceSample ()

    RETURNS TABLE

    WITH SCHEMABINDING

    AS RETURN

    WITH SelectionCounts

    AS (

    -- Number of rows to select from each group

    -- Minimum 3, total for all groups = 25

    -- Choose one of the possible combinations at random

    SELECT TOP (1)

    nc = C.n,

    nt = T.n,

    nr = R.n

    FROM dbo.Number C,

    dbo.Number T,

    dbo.Number R

    WHERE C.n BETWEEN 3 AND 19

    AND T.n BETWEEN 3 AND 19

    AND R.n BETWEEN 3 AND 19

    AND C.n + T.n + R.n = 25

    ORDER BY

    (SELECT R.n FROM dbo.Random R)

    ),

    GroupCounts

    AS (

    --Find the total number of vehicles of each type

    SELECT vehicle_type,

    group_size = COUNT_BIG(*)

    FROM dbo.Source

    WHERE vehicle_type = 'C'

    GROUP BY

    vehicle_type

    UNION ALL

    SELECT vehicle_type,

    group_size = COUNT_BIG(*)

    FROM dbo.Source

    WHERE vehicle_type = 'T'

    GROUP BY

    vehicle_type

    UNION ALL

    SELECT vehicle_type,

    group_size = COUNT_BIG(*)

    FROM dbo.Source

    WHERE vehicle_type = 'R'

    GROUP BY

    vehicle_type

    ),

    Parameters

    AS (

    SELECT GC.group_size,

    GC.vehicle_type,

    sample_size =

    CASE GC.vehicle_type

    WHEN 'C' THEN SC.nc

    WHEN 'T' THEN SC.nt

    WHEN 'R' THEN SC.nr

    ELSE 0

    END

    FROM GroupCounts GC,

    SelectionCounts SC

    ),

    RandomRows

    AS (

    -- Generate a random row number for each row required from

    -- each group. Maximum rows returned = 25 * group_count

    --

    -- The TOP (9223372036854775807 (= BIGINT.Max)) is just to

    -- force the query plan to calculate the random row number

    -- before the final join

    SELECT TOP (9223372036854775807)

    P.vehicle_type,

    Random.rn

    FROM Parameters P

    CROSS

    APPLY (

    SELECT TOP (P.sample_size)

    rn = CONVERT(BIGINT, R.n * P.group_size + 1)

    FROM dbo.Number N,

    dbo.Random R

    ORDER BY

    N.n ASC

    ) Random

    ORDER BY

    P.vehicle_type,

    Random.rn

    ),

    NumberedSourceRows

    AS (

    -- Number the source rows, restarting the numbering for each group

    SELECT S.row_id,

    S.vehicle_type,

    S.vehicle_id,

    rn = ROW_NUMBER() OVER (PARTITION BY S.vehicle_type ORDER BY S.row_id ASC)

    FROM dbo.Source S

    )

    -- Fetch the numbered rows that match the random row number, per group

    SELECT NSR.row_id,

    NSR.vehicle_type,

    NSR.vehicle_id

    FROM RandomRows RR

    JOIN NumberedSourceRows NSR

    ON NSR.vehicle_type = RR.vehicle_type

    AND NSR.rn = RR.rn;

    GO

    SET STATISTICS TIME ON;

    SELECT GSS.row_id,

    GSS.vehicle_type,

    GSS.vehicle_id

    FROM dbo.GetSourceSample() GSS;

    SET STATISTICS TIME OFF;

    GO

    -- Drop the in-line function

    IF OBJECT_ID(N'dbo.GetSourceSample', N'IF')

    IS NOT NULL

    DROP FUNCTION dbo.GetSourceSample

    -- Drop the view wrapping NEWID

    IF OBJECT_ID(N'dbo.Random', N'V')

    IS NOT NULL

    DROP VIEW dbo.Random;

    -- Drop the combinations table

    IF OBJECT_ID(N'dbo.Combinations', N'U')

    IS NOT NULL

    DROP TABLE dbo.Combinations;

    -- Drop our test table

    IF OBJECT_ID(N'dbo.Source', N'U')

    IS NOT NULL

    DROP TABLE dbo.Source;

    Plan stayed exactly the same, 94ms average CPU time.

  • Thanks for the input, its greatly appreciated. However, I'm trying some of the suggested queries and it keeps kicking me out an error:

    SELECT TOP (1)

    nc = C.n,

    nt = T.n,

    nr = R.n

    FROM dbo.Feed C,

    dbo.Feed T,

    dbo.Feed R

    WHERE C.n BETWEEN 3 AND 19

    AND T.n BETWEEN 3 AND 19

    AND R.n BETWEEN 3 AND 19

    AND C.n + T.n + R.n = 25

    ORDER BY

    (SELECT ANewID FROM dbo.MyNewID)

    I get:

    Msg 207, Level 16, State 1, Line 8

    Invalid column name 'n'.

    what does that mean?

  • SQL_NuB (3/15/2010)


    Thanks for the input, its greatly appreciated. However, I'm trying some of the suggested queries and it keeps kicking me out an error: Invalid column name 'n'.

    That was 'concept code' from Jeff to me to illustrate a point, it is not meant to run.

    You should be able to run my last contribution though - the one marked 'full script'. That includes Jeff's idea.

    Paul

  • I got an error on that was well referring to "N"

  • SQL_NuB (3/15/2010)


    I got an error on that was well referring to "N"

    Oh Lord. You're not using SQL Server 2000 are you?

    In case it is a copy/paste error, the full script is attached to this post as a text file.

    edit: fixed 2005-only bug!

  • LOL,

    No, I'm using 2005, I tried it in 08 to see if was an SQL version issue, but my 08 crashed. (good times)

  • SQL_NuB (3/15/2010)


    No, I'm using 2005, I tried it in 08 to see if was an SQL version issue, but my 08 crashed. (good times)

    :laugh:

    Actually, I just tried it on my 2005 instance and there is a bug! This line:

    SELECT row_id = IDENTITY(INTEGER, 0, 1), needs to read:

    SELECT row_id = IDENTITY(INT, 0, 1),

    I have changed my post and the text file attachment.

    Both forms work in 2008 - but not in 2005! :w00t:

  • I realise this has been answered, but I liked the challenge....

    Here's mine : not as quick or elegant as Paul/Jeff but I think the results are to my mind "more random" and the code is compact.

    ;WITH DATA AS

    (

    --here is the 10 percent part

    SELECT TOP 10 PERCENT

    row_id,

    source.vehicle_type,

    vehicle_id,

    --here is where a decision is made about which rows are part of the "three" per vehicle type

    CASE

    WHEN ROW_NUMBER() OVER(PARTITION BY source.vehicle_type ORDER BY NEW.ID) <4

    THEN 0

    ELSE 1

    END AS primary_rows

    FROM source

    -- generate a new guid for each row and use it to sort

    CROSS APPLY(SELECT NEWID()) AS NEW(ID)

    -- sorting using a column number !ugh! but how else to do it?

    ORDER BY 4, NEW.ID

    )

    -- now restrict to max 25 rows

    SELECT TOP 25 row_id,vehicle_type,vehicle_id

    FROM DATA

    ORDER BY primary_rows

    -- ensuring the "three" per type are always selected first

    Results of Paul/Jeff on 201 runs:

    vehicle_type count

    T1739

    R1653

    C1633

    Note the even distrubution...

    Results of MM on 201 runs:

    vehicle_type count

    T1137

    R2226

    C1662

    Note the resulting distribution reflects the source data distribution more closely...

    Source data distribution:

    vehicle_type count

    T22128

    R44485

    C33387

    There is an issue with the previous solution (Paul's) in that it goes wrong when the source table does not have enough records. If for example, you only have 2 type C records, the output looks like this:

    [font="Courier New"]row_idvehicle_typevehicle_id

    99999C1004507871

    99999C1004507871

    99999C1004507871

    100000C1320721006

    100000C1320721006

    100000C1320721006

    100000C1320721006

    100000C1320721006

    100000C1320721006

    100000C1320721006

    9512R116494338

    10254R1472793309

    22707R326756140

    34558R446140356

    38903R583366585

    79984R971756239

    88665R881544120

    93146R1713443659

    97315R1165209366

    98760R2042364618

    16998T134639693

    31935T519661065

    50739T551201165

    73209T1276413656

    89283T40684550[/font]

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (3/15/2010)


    There is another problem with the previous solution

    Which previous solution? 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mister.magoo (3/15/2010)


    I think the results are to my mind "more random"

    :blink: Really? 😉

    There are some good points in your post, but I have to disagree with that one.

    Your code is compact (though arguably who cares once it is in a function?) but slower, as you say.

    I will give some more thought to this later.

    I don't want to come across as just being critical - I enjoyed your post. It is an interesting challenge isn't it?

    Side note: I had an email from the OP last night clarifying the real requirements. He should be posting them here soon, just to add an interesting twist.

    Paul

  • Jeff Moden (3/15/2010)


    mister.magoo (3/15/2010)


    There is another problem with the previous solution

    Which previous solution? 🙂

    The one Paul provided http://www.sqlservercentral.com/Forums/Attachment5424.aspx

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 15 posts - 16 through 30 (of 33 total)

    You must be logged in to reply to this topic. Login to reply