Gathering Random Data

Question

Gathering Random Data

Brian Knight

One Orange Chip

Points: 27259
More actions
March 21, 2003 at 12:00 am

#159702

Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/bknight/randomvalues.asp
Brian Knight
Free SQL Server Training Webinars

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Mat Culpepper SSC Enthusiast Points: 138 More actions · Answer 1

Another method would be to use a combination of the RAND() and NEWID() functions with the Checksum function. You could use the following function to seed the rand function with a different number for each row:

Checksum(NewID())

So the randomizer would look like this:

Rand(Checksum(NewID()))

the Checksum function returns an integer (hash value) across a value, or series of columnns.

abhi_develops Say Hey Kid Points: 676 More actions · Answer 2

abhi_develops

Say Hey Kid

Points: 676

April 7, 2003 at 4:05 am

#453312

Hi Brain,

Thats a good tip. Thanks!

Cheers!

Abhijit

Hairy Hat Man SSC Enthusiast Points: 168 More actions · Answer 3

SELECT TOP 3 ProductID, ProductName

FROM products

ORDER BY NEWID()

Tends to return

17 Alice Mutton

3 Aniseed Syrup

40 Boston Crab Meat

About 50% of the time!

I suspect this is due to the fact that whilst a GUID is unique, it's make-up is not random.

Thomas Rushton SSC-Insane Points: 22649 More actions · Answer 4

Brian,

Thanks!

I have just been trying to generate random data, and wondering why rand() doesn't behave as expected. I thought I would wrap it up in a single function, but that didn't work either (something about "error 443 invalid use of rand within a function").

The wailing and gnashing of teeth was just about to start when I thought to check here. Hurrah! Result! {:)} Much hassle saved.

Thomas Rushton
blog: https://thelonedba.wordpress.com

lemosley01 Grasshopper Points: 24 More actions · Answer 5

This is how the rand() function in C/C++ works as well. It's psuedorandom, meaning, once you seed it, you will get the same sequence of numbers over and over until you reseed it. You seed/reseed in C/C++ using srand, typically the seed number is the current time. I think I read that the reason this was done was so that scientific experiments could be repeated over and over with the set of initially randomly generated numbers by using the same seed (that and there was not hardware at the time to make truly random numbers like intel incorporates in the newer processors).

I don't have an SQL 7 system handy, but on SQL 2000 RAND takes a seed value. If you want to generate random numbers then you'll need to use a seed value everytime you call RAND - SQL 2000 BOL has a topic for using RAND where they illustrate using DATEPART and GETDATE to see the RAND function so you will get different values each time. I found it by going opening BOL, clicking the Index tab, and entering RAND. That should get you RAND functions, and the second topic when you double-click it is 'Using RAND.'

Here is an excerpt from that topic the explains it:

The RAND function is a pseudorandom number generator that operates in a manner similar to the C run-time library rand function. If no seed is provided, the system generates its own variable seed numbers. If you call RAND with a seed value, you must use variable seed values to generate random numbers. If you call RAND multiple times with the same seed value, it returns the same generated value. This script returns the same value for the calls to RAND because they all use the same seed value:

SELECT RAND(159784)SELECT RAND(159784)SELECT RAND(159784)

A common way to generate random numbers from RAND is to include something relatively variable as the seed value, such as adding several parts of a GETDATE:

SELECT RAND( (DATEPART(mm, GETDATE()) * 100000 )           + (DATEPART(ss, GETDATE()) * 1000 )           + DATEPART(ms, GETDATE()) )

When you use an algorithm based on GETDATE to generate seed values, RAND can still generate duplicate values if the calls to RAND are made within the interval of the smallest datepart used in the algorithm. This is especially likely if the calls to RAND are included in a single batch. Multiple calls to RAND in a single batch can be executed within the same millisecond, which is the smallest increment of DATEPART. In this case, incorporate a value based on something other than time to generate the seed values.

Razvan Socol SSCarpal Tunnel Points: 4962 More actions · Answer 6

The view+function trick for using RAND() is very interesting, but not so efficient. On my system, it takes 330ms to select a random row from a table with about 8300 rows using the UDF, and only 30ms using the NEWID() function. Both query plans look the same, but profiler shows the difference is not in "Logical Reads", but in "CPU Time".

Razvan

Chris Cathers SSC Veteran Points: 218 More actions · Answer 7

For thoes who may be interesed, I did a little analysis to prove to my self, and others, that the Rand(Checksum(@GUID)) algorithem for generating random numbers will provide a good random number dispersal. But dont take my word for it, I've included the test code I used so you can try it too!

/*****************************************************

** Provide some statistical analysis or the random value

** generation script:

** ** Rand(Checksum(@GUID)) **

**

** This is my attempt to provide some statistical analysis

** of the output generated by this algorithem. I wanted to

** prove to my self, and any who are interested, that the

** algorithem produceds an even distribution of the numbers

** it generates.

**

** To facilitat this approach I created a temp table #rand_gen_analysis,

** which is currently commented out, to hold the statistacly

** generated data. (Uncomment to crate the table, then re-comment

** for the data generation runs)

**

** The Analysis

**

** I wrote some analysis scripts to try and break down the

** random number distribution. I broke the numbers into

** tenth of a whole segments. This I called the Dispersion.

** Idealy we will want the same number of data points in each

** dispersion range.

** Ne next value, the number of Rands is the number of

** data points in the dispersion segment. Idealy this number

** would be the same in all dispersion points.

** Next is Mean_Rands. This is the average of number of Rands

** over the 10 dispersion points.

** Mean_deveation is how far off the average each dispersion

** point has drifted.

** Lastly is the AVG_Rands. This is the average distrabution of

** the data points in the dispersion segment. Idealy we will want

** this value to fall in the m iddle of the dispersion range.

** I.e ofr Dispersion range 0.2 to 0.3 we would want to see an

** avg_rands of 0.25.

**

** My test showed that there was at most the Mean deveation drift

** 0.4% to 0.3%. This value decreased as the sample size increased.

** This is to be expeced and is desirable. As the sample size increased

** the Avg rands drew closer and closer to the Dispersion Mean. This

** is also desirable and expectd. Therefore, I can conclude that

** the algorithem is a good random generator.

**

*****************************************************/

-- CREATE TABLE #RAND_GEN_ANALYSIS

-- ( RECORD_COUNT INT IDENTITY,

-- GUID UNIQUEIDENTIFIER,

-- RAND_VAL DECIMAL(8,8)

-- )

-- DROP TABLE #RAND_GEN_ANALYSIS

DECLARE @RECORD_COUNT INT,

@GUID UNIQUEIDENTIFIER

SET @RECORD_COUNT = 0

WHILE @RECORD_COUNT < 10000

BEGIN

SELECT @GUID = NEWID(),

@RECORD_COUNT = @RECORD_COUNT + 1

INSERT INTO #RAND_GEN_ANALYSIS

(GUID, RAND_VAL)

VALUES (@GUID,

Rand(Checksum(@GUID))

)

END

/***************************************************

** Finished data generating Now ready to start

** statistacal analysis

**

****************************************************/

DECLARE @MN_RANDS DECIMAL(18,3)

DECLARE @DNA TABLE

( DISPERSION VARCHAR(4),

NBR_RANDS INT,

AVG_RANDS DECIMAL(8,8)

)

INSERT INTO @DNA

SELECT DISPERSION,

COUNT(RAND_VAL) AS NBR_RANDS,

AVG(RAND_VAL) AS AVG_RANDS

FROM (

SELECT CASE

WHEN RAND_VAL BETWEEN 0 AND 0.1 THEN '0'

WHEN RAND_VAL BETWEEN 0.1 AND 0.2 THEN '0.1'

WHEN RAND_VAL BETWEEN 0.2 AND 0.3 THEN '0.2'

WHEN RAND_VAL BETWEEN 0.3 AND 0.4 THEN '0.3'

WHEN RAND_VAL BETWEEN 0.4 AND 0.5 THEN '0.4'

WHEN RAND_VAL BETWEEN 0.5 AND 0.6 THEN '0.5'

WHEN RAND_VAL BETWEEN 0.6 AND 0.7 THEN '0.6'

WHEN RAND_VAL BETWEEN 0.7 AND 0.8 THEN '0.7'

WHEN RAND_VAL BETWEEN 0.8 AND 0.9 THEN '0.8'

WHEN RAND_VAL BETWEEN 0.9 AND 1 THEN '0.9'

ELSE 'Not yet categorized'

END as dispersion,

rand_val

FROM #RAND_GEN_ANALYSIS) RGA

GROUP BY DISPERSION

ORDER BY DISPERSION

SELECT @MN_RANDS = AVG(NBR_RANDS) FROM @DNA

SELECT DISPERSION,

NBR_RANDS,

@MN_RANDS AS Mean_rands,

CAST((NBR_RANDS / @MN_RANDS) AS DECIMAL(9,8)) AS Mean_deveation,

AVG_RANDS

FROM @DNA

order by Mean_deveation desc

Have fun!

German E. Casaretto Grasshopper Points: 17 More actions · Answer 8

I've read a lot about how to generate random real-world data in SQl, but yet I haven't found what I'm needing.

In my database there are many dimensions. If I use a Rand() function to generate each dimension ID to populate the Fact table, the distribution or dispersion is even. Wht I need is something to generate random data, but without an even distribution. This is, some of the dimension members must not have any record in the fact table, and the amount of records in the fact for each ID must be different (with Rand() these amounts where very similar)

Thanks

jcb SSCertifiable Points: 6574 More actions · Answer 9

If u just want to return n rows from a table (or view, etc) at random u can use the newid() approach wich a temp table.

set nocount on

select newid() as rand, tt.*

into #temp

from target_table tt

set nocount off

select top n * from #temp order by rand

You can easily put it in a stored procedure for use with any table

Gathering Random Data

Cookies on SQLServerCentral