RANDom traps for the unwary

  • For those interested in the generation of non uniform random numbers (again and still pseudo random), the late, great Dwain Camps wrote a very nice article on the subject at http://www.sqlservercentral.com/articles/SQL+Uniform+Random+Numbers/91103/ .

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @david-2 Poole,

    Awesome job on demonstrating the RAND() function of SQL Server and the "binning" and not so random nature of sequential seeds. Thanks for taking the time to put it together.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sioban Krzywicki (5/9/2016)

    I'd rather do this

    SELECT (ABS(CAST(CHECKSUM(NewID()) AS bigint)) %20) + 1

    than divide by 10. Matter of preference, really.

    Also correct, but to get value from range <1,20> you cast it as bigint.

    Matter of preference...

    (From CPU point of view, and using CPU preference - the best way is divide by 2 or 4 or 8 -> "divide by shifting bits")

  • wojciech.muszynski (5/9/2016)


    SELECT *

    FROM Person.StateProvince

    WHERE (ABS(CAST((BINARY_CHECKSUM(*) * RAND()) AS INT)) % 100) < 10

    What happens if CAST((BINARY_CHECKSUM(*) * RAND()) AS INT = - 2^31 ?

    Int Range is <-2 147 483 648, +2 147 483 647>

    If BINARY_CHECKSUM(*) * RAND() = -2 147 483 648

    then ABS() function doesn't work properly and raises error: Arithmetic overflow.

    It happens only once per 4 billions (2^32),

    but this mean it will happen day after deploy on production.

    Heh... absolutely correct. It doesn't happen often but it does happen. I ran the following twice last night. The first time, it hit the failing number in the first 2 minutes. The second time took almost 30 minutes. It's part of the reason I only use the method for the generation of "small" million row test tables where all I have to do is hit the {f5} key to quickly rerun.

    DECLARE @Bitbucket BIGINT;

    SELECT @Bitbucket = ABS(CHECKSUM(NEWID()))%20+1

    FROM dbo.fnTally(1,10000000000)

    ;

    fnTally looks like the following...

    CREATE FUNCTION [dbo].[fnTally]

    /**********************************************************************************************************************

    Purpose:

    Return a column of BIGINTs from @ZeroOrOne up to and including @MaxN with a max value of 1 Trillion.

    As a performance note, it takes about 00:02:10 (hh:mm:ss) to generate 1 Billion numbers to a throw-away variable.

    Usage:

    --===== Syntax example (Returns BIGINT)

    SELECT t.N

    FROM dbo.fnTally(@ZeroOrOne,@MaxN) t

    ;

    Notes:

    1. Based on Itzik Ben-Gan's cascading CTE (cCTE) method for creating a "readless" Tally Table source of BIGINTs.

    Refer to the following URLs for how it works and introduction for how it replaces certain loops.

    http://www.sqlservercentral.com/articles/T-SQL/62867/

    http://sqlmag.com/sql-server/virtual-auxiliary-table-numbers

    2. To start a sequence at 0, @ZeroOrOne must be 0 or NULL. Any other value that's convertable to the BIT data-type

    will cause the sequence to start at 1.

    3. If @ZeroOrOne = 1 and @MaxN = 0, no rows will be returned.

    5. If @MaxN is negative or NULL, a "TOP" error will be returned.

    6. @MaxN must be a positive number from >= the value of @ZeroOrOne up to and including 1 Billion. If a larger

    number is used, the function will silently truncate after 1 Billion. If you actually need a sequence with

    that many values, you should consider using a different tool. ;-)

    7. There will be a substantial reduction in performance if "N" is sorted in descending order. If a descending

    sort is required, use code similar to the following. Performance will decrease by about 27% but it's still

    very fast especially compared with just doing a simple descending sort on "N", which is about 20 times slower.

    If @ZeroOrOne is a 0, in this case, remove the "+1" from the code.

    DECLARE @MaxN BIGINT;

    SELECT @MaxN = 1000;

    SELECT DescendingN = @MaxN-N+1

    FROM dbo.fnTally(1,@MaxN);

    8. There is no performance penalty for sorting "N" in ascending order because the output is explicity sorted by

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    Revision History:

    Rev 00 - Unknown - Jeff Moden

    - Initial creation with error handling for @MaxN.

    Rev 01 - 09 Feb 2013 - Jeff Moden

    - Modified to start at 0 or 1.

    Rev 02 - 16 May 2013 - Jeff Moden

    - Removed error handling for @MaxN because of exceptional cases.

    Rev 03 - 22 Apr 2015 - Jeff Moden

    - Modify to handle 1 Trillion rows for experimental purposes.

    **********************************************************************************************************************/

    (@ZeroOrOne BIT, @MaxN BIGINT)

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN WITH

    E1(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1) --10E1 or 10 rows

    , E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d) --10E4 or 10 Thousand rows

    ,E12(N) AS (SELECT 1 FROM E4 a, E4 b, E4 c) --10E12 or 1 Trillion rows

    SELECT N = 0 WHERE ISNULL(@ZeroOrOne,0)= 0 --Conditionally start at 0.

    UNION ALL

    SELECT TOP(@MaxN) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E12 -- Values from 1 to @MaxN

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you want INTEGER values that are guaranteed not to run into a bitter end, then you might do something like the following because NEWID() is guaranteed to never have the value of FFFFFFFF-FFFF-FFFF-FFFF-FFFFFFFFFFFF.

    ABS(CONVERT(BIGINT,CONVERT(BINARY(16),NEWID())))%20+1

    If you believe in portable code, you can use CAST instead of CONVERT. Either way is almost as fast as the simple CHECKSUM method and certainly faster than using GO, a WHILE loop, or other form of RBAR.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you bothered to read the documentation you would know that for a given seed, RAND will produce the same sequence. This is by design. And as for .NET - yes it works as intended. Given the same seed, you get the same result set. Again - this is by design.

  • Bill Talada (5/9/2016)


    I believe scientists rely on hardware solutions such as radioactive decay detection devices to get true randomness.

    I stay away from pseudo random data generation and try to approach real-world distributions. There are likely 100,000 times more Bakers, Cooks, Millers, Smiths, and Tailors than there are of my last name. Most things in the world follow a power law distribution, natural distribution, or other mathematical deviation from linear. Scientists looking for life on other planets are guided by the rule that: God doesn't use straight lines. And the most common error made is in thinking that a bell curve will look anything like the standard one taught in statistics classes.

    Unless you have a professor who forces everyone to fit in the bell curve. Really, I got a B+ in class with a 93 overall.

  • kiwood (5/10/2016)


    If you bothered to read the documentation you would know that for a given seed, RAND will produce the same sequence. This is by design. And as for .NET - yes it works as intended. Given the same seed, you get the same result set. Again - this is by design.

    To whom are you "speaking"? The author or someone else?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • kiwood (5/10/2016)


    If you bothered to read the documentation you would know that for a given seed, RAND will produce the same sequence. This is by design. And as for .NET - yes it works as intended. Given the same seed, you get the same result set. Again - this is by design.

    The brief for the article was to produce a broad brush coverage of the RAND function aimed at an audience with a range of abilities.

    Of course I read the documentation and quite a few things besides.

    That RAND produces the same value for the same seed is not the issue. The issue is that there is a simple linear equation applied to the seed to produce the random number. I have not yet investigated predicting the output of RAND for post-seed selections but I suspect that it may indeed be predictable.

    If RAND can be predicted then that tells you that certain uses should be avoided because you don't want people being able to predetermine the outcome.

    As to RAND working as intended my issue is that its intention was not geared up to set based processing in a system that is a set based processor.

    Yes, you can do cross-apply on RAND in a view, seed it with CHECKSUM of NEWID and various other things. My contention is that a RAND function designed for use in set based processing should not need work arounds.

    Finally, the simple equation used for RAND will generate the same value for the same seed even if run on separate machines. Again this opens up the predictability issue.

    My contention is that a given seed on two separate machines should produce different results. Whatever seed you supply should have some interaction with a machine property or SQLServer licence key.

    Again, I know that is not the way that the function is written or documented. I'm saying that RAND could and should be written to be non-predictable and unique to the installation of SQLServer

  • Jeff Moden (5/10/2016)


    Heh... absolutely correct. It doesn't happen often but it does happen.

    Scientists have calculated that the chance of anything so patently absurd actually existing are millions to one.

    But magicians have calculated that million-to-one chances crop up nine times out of ten.

    --Terry Pratchett--

  • wojciech.muszynski (5/11/2016)


    Jeff Moden (5/10/2016)


    Heh... absolutely correct. It doesn't happen often but it does happen.

    Scientists have calculated that the chance of anything so patently absurd actually existing are millions to one.

    But magicians have calculated that million-to-one chances crop up nine times out of ten.

    --Terry Pratchett--

    Heh... and like previously said, nature has proven that 1 in 4 billion chances will occur the day after the release to production. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Very interesting article. I'm also enjoying seeing the methods that various people have used to get random results.

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply