RANDom traps for the unwary

  • Comments posted to this topic are about the item RANDom traps for the unwary

  • Did you consider to create a view with Rand() column?

    CREATE VIEW [Common].[vwRandomView]

    AS

    SELECT RAND() Result

    and use CROSS APPLY with this view?

    It worked for me ๐Ÿ˜‰

  • SELECT *

    FROM Person.StateProvince

    WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10

    What happens if CAST((BINARY_CHECKSUM(*) * RAND()) AS INT = - 2^31 ?

    Int Range is <-2 147 483 648, +2 147 483 647>

    If BINARY_CHECKSUM(*) * RAND() = -2 147 483 648

    then ABS() function doesn't work properly and raises error: Arithmetic overflow.

    It happens only once per 4 billions (2^32),

    but this mean it will happen day after deploy on production.

  • Most built-in random number generators are "cheap" Linear Congruential Generators:

    Xn+1 = (a Xn + c) Mod m

    They produce a fixed output sequence. The sequence is started at a point determined by the seed. The seed is similar to saying "start the sequence at index n". Re-seeding the generator starts it at the SAME point.

    In SQL:

    -- Seed

    SELECT RAND(12345)

    GO

    SELECT RAND()

    GO 50

    Will produce 50 reasonable random numbers. Running the whole batch again will produce the SAME sequence, because the seed is the same.

    SELECT RAND(12345)

    GO 50

    will produce the SAME value 50 times because the generator was re-seeded (set back to the same initial index on every run).

    The correct usage is seed ONCE, use MANY times.

    However beware, when used in a select, it is only evaluated once, eg

    SELECT RAND(),*

    FROM sys.objects

    GO

    will output the SAME value.

    -- Seed ONCE (based on the current millisecond)

    SELECTRAND(DATEPART(MILLISECOND,GETDATE()))

    GO

    -- Use MANY

    SELECTRAND()

    GO50

    will produce 50 reasonable random numbers, and every run of the whole batch will be different.

    Ultimately, it is the quality of the INITIAL seed that governs the output sequence. There are numerous strategies for a good seed: current date, GUID, CHECKSUM etc. Develop one that works for your situation, but remember: SAME seed = SAME sequence.

  • ...It happends only once per 4 bilions (2^32),

    but this mean it will happen day after deploy on production.

    This made my day! ๐Ÿ˜€

  • Thank you! I was planning on writing a similar response. The problem here is not with the Random function, but with the author's confusion about the proper use and limitations of the Random function.

  • Thanks for the article.

  • I believe I saw Jeff Moden of these forums post a simple, truly random generator.

    SELECT (ABS(CHECKSUM(NEWID())) %20) + 1

    This generates random numbers between 1 and 20. To have a different range, change the Modulo number.

    Very elegant.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    Itโ€™s unpleasantly like being drunk.
    Whatโ€™s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Sioban Krzywicki (5/9/2016)


    I believe I saw Jeff Moden of these forums post a simple, truly random generator.

    SELECT (ABS(CHECKSUM(NEWID())) %20) + 1

    This generates random numbers between 1 and 20. To have a different range, change the Modulo number.

    Very elegant.

    CHECKSUM(NEWID()) can be -2 147 483 648.

    Select ABS( -2147483648) -> Msg 8115, LEVEL 16, State 2, Line 2 Arithmetic overflow error converting expression to data type int.

    Better version:

    SELECT (ABS(CHECKSUM(NEWID())/10) %20) + 1

    louie1487 78804 (5/9/2016)


    ...It happens only once per 4 billions (2^32),

    but this mean it will happen day after deploy on production.

    This made my day! ๐Ÿ˜€

    It's happend!

  • wojciech.muszynski (5/9/2016)


    Sioban Krzywicki (5/9/2016)


    I believe I saw Jeff Moden of these forums post a simple, truly random generator.

    SELECT (ABS(CHECKSUM(NEWID())) %20) + 1

    This generates random numbers between 1 and 20. To have a different range, change the Modulo number.

    Very elegant.

    CHECKSUM(NEWID()) can be -2 147 483 648.

    Select ABS( -2147483648) -> Msg 8115, LEVEL 16, State 2, Line 2 Arithmetic overflow error converting expression to data type int.

    Better version:

    SELECT (ABS(CHECKSUM(NEWID())/10) %20) + 1

    louie1487 78804 (5/9/2016)


    ...It happens only once per 4 billions (2^32),

    but this mean it will happen day after deploy on production.

    This made my day! ๐Ÿ˜€

    It's happend!

    I'd rather do this

    SELECT (ABS(CAST(CHECKSUM(NewID()) AS bigint)) %20) + 1

    than divide by 10. Matter of preference, really.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    Itโ€™s unpleasantly like being drunk.
    Whatโ€™s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • I believe scientists rely on hardware solutions such as radioactive decay detection devices to get true randomness.

    I stay away from pseudo random data generation and try to approach real-world distributions. There are likely 100,000 times more Bakers, Cooks, Millers, Smiths, and Tailors than there are of my last name. Most things in the world follow a power law distribution, natural distribution, or other mathematical deviation from linear. Scientists looking for life on other planets are guided by the rule that: God doesn't use straight lines. And the most common error made is in thinking that a bell curve will look anything like the standard one taught in statistics classes.

  • I use the

    ABS(CONVERT(BIGINT, CHECKSUM(NEWID())))

    method whenever I need random numbers. I've just tested with the method described by the author and found

    1)

      OutOfOrder=500049

      InOrder=499951

    2)

      Low occurrence=9907

      High occurrence=10139

    Both look pretty random to me.

  • Given the ascendency of data science I think there is scope for some articles utilising the different statiscal techniques.

    I did enjoy the data mining series on this site.. I've learned more about statistics in the past 5 years than I did at school. Now I have a real world context a dry subject has come to life

  • robert.bruhin (5/9/2016)


    Thank you! I was planning on writing a similar response. The problem here is not with the Random function, but with the author's confusion about the proper use and limitations of the Random function.

    What confusion and what proper use?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Andrew Morgan (5/9/2016)


    However beware, when used in a select, it is only evaluated once, eg

    SELECT RAND(),*

    FROM sys.objects

    GO

    will output the SAME value.

    That and a workaround substantial enough for the generation of simple, non-critical test data was covered in the article.

    -- Seed ONCE (based on the current millisecond)

    SELECTRAND(DATEPART(MILLISECOND,GETDATE()))

    GO

    -- Use MANY

    SELECTRAND()

    GO50

    will produce 50 reasonable random numbers, and every run of the whole batch will be different.

    Use that for generating a million pseudo random values for simple, non-critical test data and see how interested you'll be in doing multiple repeated tests. ๐Ÿ˜‰

    Ultimately, it is the quality of the INITIAL seed that governs the output sequence. There are numerous strategies for a good seed: current date, GUID, CHECKSUM etc. Develop one that works for your situation, but remember: SAME seed = SAME sequence.

    Agreed... that and the scale, or rather, the lack of scale of the values. The 15 decimal places offered by FLOAT in SQL Server and similar functions in other languages is quick to form "pockets" of data simply due to the lack of scale, as also demonstrated in the article. Of course, a lot of calculations done by even Statisticians have the same problem but the calculations are frequently "good enough".

    For the generation of simple, non-critical test data and non-critical pseudo random selection, it's sufficient and I believe that's what the article was aimed at. If we're talking true randomness, obviously neither the methods in the article nor most programming-level functions will suffice.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply