Gathering Random Data

  • Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/bknight/randomvalues.asp

  • Another method would be to use a combination of the RAND() and NEWID() functions with the Checksum function. You could use the following function to seed the rand function with a different number for each row:

    Checksum(NewID())

    So the randomizer would look like this:

    Rand(Checksum(NewID()))

    the Checksum function returns an integer (hash value) across a value, or series of columnns.

  • Hi Brain,

    Thats a good tip. Thanks!

    Cheers!

    Abhijit

  • SELECT TOP 3 ProductID, ProductName

    FROM products

    ORDER BY NEWID()

    Tends to return

    17 Alice Mutton

    3 Aniseed Syrup

    40 Boston Crab Meat

    About 50% of the time!

    I suspect this is due to the fact that whilst a GUID is unique, it's make-up is not random.

  • Brian,

    Thanks!

    I have just been trying to generate random data, and wondering why rand() doesn't behave as expected. I thought I would wrap it up in a single function, but that didn't work either (something about "error 443 invalid use of rand within a function").

    The wailing and gnashing of teeth was just about to start when I thought to check here. Hurrah! Result! {:)} Much hassle saved.

    Thomas Rushton
    blog: https://thelonedba.wordpress.com

  • This is how the rand() function in C/C++ works as well. It's psuedorandom, meaning, once you seed it, you will get the same sequence of numbers over and over until you reseed it. You seed/reseed in C/C++ using srand, typically the seed number is the current time. I think I read that the reason this was done was so that scientific experiments could be repeated over and over with the set of initially randomly generated numbers by using the same seed (that and there was not hardware at the time to make truly random numbers like intel incorporates in the newer processors).

    I don't have an SQL 7 system handy, but on SQL 2000 RAND takes a seed value. If you want to generate random numbers then you'll need to use a seed value everytime you call RAND - SQL 2000 BOL has a topic for using RAND where they illustrate using DATEPART and GETDATE to see the RAND function so you will get different values each time. I found it by going opening BOL, clicking the Index tab, and entering RAND. That should get you RAND functions, and the second topic when you double-click it is 'Using RAND.'

    Here is an excerpt from that topic the explains it:

    The RAND function is a pseudorandom number generator that operates in a manner similar to the C run-time library rand function. If no seed is provided, the system generates its own variable seed numbers. If you call RAND with a seed value, you must use variable seed values to generate random numbers. If you call RAND multiple times with the same seed value, it returns the same generated value. This script returns the same value for the calls to RAND because they all use the same seed value:

    SELECT RAND(159784)SELECT RAND(159784)SELECT RAND(159784)

    A common way to generate random numbers from RAND is to include something relatively variable as the seed value, such as adding several parts of a GETDATE:

    SELECT RAND( (DATEPART(mm, GETDATE()) * 100000 )           + (DATEPART(ss, GETDATE()) * 1000 )           + DATEPART(ms, GETDATE()) )

    When you use an algorithm based on GETDATE to generate seed values, RAND can still generate duplicate values if the calls to RAND are made within the interval of the smallest datepart used in the algorithm. This is especially likely if the calls to RAND are included in a single batch. Multiple calls to RAND in a single batch can be executed within the same millisecond, which is the smallest increment of DATEPART. In this case, incorporate a value based on something other than time to generate the seed values.

  • The view+function trick for using RAND() is very interesting, but not so efficient. On my system, it takes 330ms to select a random row from a table with about 8300 rows using the UDF, and only 30ms using the NEWID() function. Both query plans look the same, but profiler shows the difference is not in "Logical Reads", but in "CPU Time".

    Razvan

  • For thoes who may be interesed, I did a little analysis to prove to my self, and others, that the Rand(Checksum(@GUID)) algorithem for generating random numbers will provide a good random number dispersal. But dont take my word for it, I've included the test code I used so you can try it too!

     

    /*****************************************************

    **  Provide some statistical analysis or the random value

    **  generation script:

    **            ** Rand(Checksum(@GUID)) **

    **

    **  This is my attempt to provide some statistical analysis

    **  of the output generated by this algorithem. I wanted to

    **  prove to my self, and any who are interested, that the

    **  algorithem produceds an even distribution of the numbers

    **  it generates.

    ** 

    **  To facilitat this approach I created a temp table #rand_gen_analysis,

    **  which is currently commented out, to hold the statistacly

    **  generated data. (Uncomment to crate the table, then re-comment

    **  for the data generation runs)

    ** 

    **          The Analysis

    ** 

    **  I wrote some analysis scripts to try and break down the

    **  random number distribution. I broke the numbers into

    **  tenth of a whole segments. This I called the Dispersion.

    **  Idealy we will want the same number of data points in each

    **  dispersion range.

    **  Ne next value, the number of Rands is the number of

    **  data points in the dispersion segment. Idealy this number

    **  would be the same in all dispersion points.

    **  Next is Mean_Rands. This is the average of number of Rands

    **  over the 10 dispersion points.

    **  Mean_deveation is how far off the average each dispersion

    **  point has drifted.

    **  Lastly is the AVG_Rands. This is the average distrabution of

    **  the data points in the dispersion segment. Idealy we will want

    **  this value to fall in the m iddle of the dispersion range.

    **  I.e ofr Dispersion range 0.2 to 0.3 we would want to see an

    **  avg_rands of 0.25.

    ** 

    **  My test showed that there was at most the Mean deveation drift

    **  0.4% to 0.3%. This value decreased as the sample size   increased.

    **  This is to be expeced and is desirable. As the sample size increased

    **  the Avg rands drew closer and closer to the Dispersion Mean. This

    **  is also desirable and expectd. Therefore, I can conclude that

    **  the algorithem is a good random generator.

    ** 

    *****************************************************/

    -- CREATE TABLE #RAND_GEN_ANALYSIS

    -- ( RECORD_COUNT        INT IDENTITY,

    --   GUID                UNIQUEIDENTIFIER,

    --   RAND_VAL            DECIMAL(8,8)

    -- )

    -- DROP TABLE #RAND_GEN_ANALYSIS

    DECLARE @RECORD_COUNT INT,

            @GUID         UNIQUEIDENTIFIER

    SET @RECORD_COUNT = 0

    WHILE @RECORD_COUNT < 10000

    BEGIN

     

      SELECT @GUID = NEWID(),

             @RECORD_COUNT = @RECORD_COUNT + 1

      INSERT INTO #RAND_GEN_ANALYSIS

      (GUID, RAND_VAL)

      VALUES (@GUID,

              Rand(Checksum(@GUID))

             )

    END

    /***************************************************

    **  Finished data generating Now ready to start

    **  statistacal analysis

    **

    ****************************************************/

    DECLARE @MN_RANDS    DECIMAL(18,3)

    DECLARE @Dna TABLE

    ( DISPERSION     VARCHAR(4),

      NBR_RANDS      INT,

      AVG_RANDS      DECIMAL(8,8)

    )

    INSERT INTO @Dna

    SELECT DISPERSION,

           COUNT(RAND_VAL) AS NBR_RANDS,

           AVG(RAND_VAL) AS AVG_RANDS

     FROM (

      SELECT CASE

               WHEN RAND_VAL BETWEEN 0   AND 0.1 THEN '0'

               WHEN RAND_VAL BETWEEN 0.1 AND 0.2 THEN '0.1'

               WHEN RAND_VAL BETWEEN 0.2 AND 0.3 THEN '0.2'

               WHEN RAND_VAL BETWEEN 0.3 AND 0.4 THEN '0.3'

               WHEN RAND_VAL BETWEEN 0.4 AND 0.5 THEN '0.4'

               WHEN RAND_VAL BETWEEN 0.5 AND 0.6 THEN '0.5'

               WHEN RAND_VAL BETWEEN 0.6 AND 0.7 THEN '0.6'

               WHEN RAND_VAL BETWEEN 0.7 AND 0.8 THEN '0.7'

               WHEN RAND_VAL BETWEEN 0.8 AND 0.9 THEN '0.8'

               WHEN RAND_VAL BETWEEN 0.9 AND 1 THEN '0.9'

               ELSE 'Not yet categorized'

            END as dispersion,

            rand_val

      FROM  #RAND_GEN_ANALYSIS) RGA

    GROUP BY DISPERSION

    ORDER BY DISPERSION

    SELECT @MN_RANDS = AVG(NBR_RANDS) FROM @Dna

    SELECT DISPERSION,

           NBR_RANDS,

           @MN_RANDS AS Mean_rands,

           CAST((NBR_RANDS / @MN_RANDS) AS DECIMAL(9,8)) AS Mean_deveation,

           AVG_RANDS

     FROM @Dna

    order by Mean_deveation desc

     

    Have fun!

  • I've read a lot about how to generate random real-world data in SQl, but yet I haven't found what I'm needing.

    In my database there are many dimensions. If I use a Rand() function to generate each dimension ID to populate the Fact table, the distribution or dispersion is even. Wht I need is something to generate random data, but without an even distribution. This is, some of the dimension members must not have any record in the fact table, and the amount of records in the fact for each ID must be different (with Rand() these amounts where very similar)

    Thanks

  • If u just want to return n rows from a table (or view, etc) at random u can use the newid() approach wich a temp table.

    set nocount on

    select newid() as rand, tt.*

    into #temp

    from target_table tt

    set nocount off

    select top n * from #temp order by rand

    You can easily put it in a stored procedure for use with any table

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply