March 21, 2003 at 12:00 am
Comments posted to this topic are about the content posted at http://www.sqlservercentral.com/columnists/bknight/randomvalues.asp
Brian Knight
Free SQL Server Training Webinars
March 24, 2003 at 8:19 am
Another method would be to use a combination of the RAND() and NEWID() functions with the Checksum function. You could use the following function to seed the rand function with a different number for each row:
Checksum(NewID())
So the randomizer would look like this:
Rand(Checksum(NewID()))
the Checksum function returns an integer (hash value) across a value, or series of columnns.
April 7, 2003 at 4:05 am
Hi Brain,
Thats a good tip. Thanks!
Cheers!
Abhijit
June 26, 2003 at 4:02 am
SELECT TOP 3 ProductID, ProductName
FROM products
ORDER BY NEWID()
Tends to return
17 Alice Mutton
3 Aniseed Syrup
40 Boston Crab Meat
About 50% of the time!
I suspect this is due to the fact that whilst a GUID is unique, it's make-up is not random.
November 6, 2003 at 5:00 am
Brian,
Thanks!
I have just been trying to generate random data, and wondering why rand() doesn't behave as expected. I thought I would wrap it up in a single function, but that didn't work either (something about "error 443 invalid use of rand within a function").
The wailing and gnashing of teeth was just about to start when I thought to check here. Hurrah! Result! {:)} Much hassle saved.
Thomas Rushton
blog: https://thelonedba.wordpress.com
March 26, 2004 at 6:13 am
This is how the rand() function in C/C++ works as well. It's psuedorandom, meaning, once you seed it, you will get the same sequence of numbers over and over until you reseed it. You seed/reseed in C/C++ using srand, typically the seed number is the current time. I think I read that the reason this was done was so that scientific experiments could be repeated over and over with the set of initially randomly generated numbers by using the same seed (that and there was not hardware at the time to make truly random numbers like intel incorporates in the newer processors).
I don't have an SQL 7 system handy, but on SQL 2000 RAND takes a seed value. If you want to generate random numbers then you'll need to use a seed value everytime you call RAND - SQL 2000 BOL has a topic for using RAND where they illustrate using DATEPART and GETDATE to see the RAND function so you will get different values each time. I found it by going opening BOL, clicking the Index tab, and entering RAND. That should get you RAND functions, and the second topic when you double-click it is 'Using RAND.'
Here is an excerpt from that topic the explains it:
The RAND function is a pseudorandom number generator that operates in a manner similar to the C run-time library rand function. If no seed is provided, the system generates its own variable seed numbers. If you call RAND with a seed value, you must use variable seed values to generate random numbers. If you call RAND multiple times with the same seed value, it returns the same generated value. This script returns the same value for the calls to RAND because they all use the same seed value:
SELECT RAND(159784)SELECT RAND(159784)SELECT RAND(159784)
A common way to generate random numbers from RAND is to include something relatively variable as the seed value, such as adding several parts of a GETDATE:
SELECT RAND( (DATEPART(mm, GETDATE()) * 100000 ) + (DATEPART(ss, GETDATE()) * 1000 ) + DATEPART(ms, GETDATE()) )
When you use an algorithm based on GETDATE to generate seed values, RAND can still generate duplicate values if the calls to RAND are made within the interval of the smallest datepart used in the algorithm. This is especially likely if the calls to RAND are included in a single batch. Multiple calls to RAND in a single batch can be executed within the same millisecond, which is the smallest increment of DATEPART. In this case, incorporate a value based on something other than time to generate the seed values.
March 29, 2004 at 3:11 am
The view+function trick for using RAND() is very interesting, but not so efficient. On my system, it takes 330ms to select a random row from a table with about 8300 rows using the UDF, and only 30ms using the NEWID() function. Both query plans look the same, but profiler shows the difference is not in "Logical Reads", but in "CPU Time".
Razvan
July 25, 2005 at 10:53 am
For thoes who may be interesed, I did a little analysis to prove to my self, and others, that the Rand(Checksum(@GUID)) algorithem for generating random numbers will provide a good random number dispersal. But dont take my word for it, I've included the test code I used so you can try it too!
/*****************************************************
** Provide some statistical analysis or the random value
** generation script:
** ** Rand(Checksum(@GUID)) **
**
** This is my attempt to provide some statistical analysis
** of the output generated by this algorithem. I wanted to
** prove to my self, and any who are interested, that the
** algorithem produceds an even distribution of the numbers
** it generates.
**
** To facilitat this approach I created a temp table #rand_gen_analysis,
** which is currently commented out, to hold the statistacly
** generated data. (Uncomment to crate the table, then re-comment
** for the data generation runs)
**
** The Analysis
**
** I wrote some analysis scripts to try and break down the
** random number distribution. I broke the numbers into
** tenth of a whole segments. This I called the Dispersion.
** Idealy we will want the same number of data points in each
** dispersion range.
** Ne next value, the number of Rands is the number of
** data points in the dispersion segment. Idealy this number
** would be the same in all dispersion points.
** Next is Mean_Rands. This is the average of number of Rands
** over the 10 dispersion points.
** Mean_deveation is how far off the average each dispersion
** point has drifted.
** Lastly is the AVG_Rands. This is the average distrabution of
** the data points in the dispersion segment. Idealy we will want
** this value to fall in the m iddle of the dispersion range.
** I.e ofr Dispersion range 0.2 to 0.3 we would want to see an
** avg_rands of 0.25.
**
** My test showed that there was at most the Mean deveation drift
** 0.4% to 0.3%. This value decreased as the sample size increased.
** This is to be expeced and is desirable. As the sample size increased
** the Avg rands drew closer and closer to the Dispersion Mean. This
** is also desirable and expectd. Therefore, I can conclude that
** the algorithem is a good random generator.
**
*****************************************************/
-- CREATE TABLE #RAND_GEN_ANALYSIS
-- ( RECORD_COUNT INT IDENTITY,
-- GUID UNIQUEIDENTIFIER,
-- RAND_VAL DECIMAL(8,8)
-- )
-- DROP TABLE #RAND_GEN_ANALYSIS
DECLARE @RECORD_COUNT INT,
@GUID UNIQUEIDENTIFIER
SET @RECORD_COUNT = 0
WHILE @RECORD_COUNT < 10000
BEGIN
SELECT @GUID = NEWID(),
@RECORD_COUNT = @RECORD_COUNT + 1
INSERT INTO #RAND_GEN_ANALYSIS
(GUID, RAND_VAL)
VALUES (@GUID,
Rand(Checksum(@GUID))
)
END
/***************************************************
** Finished data generating Now ready to start
** statistacal analysis
**
****************************************************/
DECLARE @MN_RANDS DECIMAL(18,3)
DECLARE @Dna TABLE
( DISPERSION VARCHAR(4),
NBR_RANDS INT,
AVG_RANDS DECIMAL(8,8)
)
INSERT INTO @Dna
SELECT DISPERSION,
COUNT(RAND_VAL) AS NBR_RANDS,
AVG(RAND_VAL) AS AVG_RANDS
FROM (
SELECT CASE
WHEN RAND_VAL BETWEEN 0 AND 0.1 THEN '0'
WHEN RAND_VAL BETWEEN 0.1 AND 0.2 THEN '0.1'
WHEN RAND_VAL BETWEEN 0.2 AND 0.3 THEN '0.2'
WHEN RAND_VAL BETWEEN 0.3 AND 0.4 THEN '0.3'
WHEN RAND_VAL BETWEEN 0.4 AND 0.5 THEN '0.4'
WHEN RAND_VAL BETWEEN 0.5 AND 0.6 THEN '0.5'
WHEN RAND_VAL BETWEEN 0.6 AND 0.7 THEN '0.6'
WHEN RAND_VAL BETWEEN 0.7 AND 0.8 THEN '0.7'
WHEN RAND_VAL BETWEEN 0.8 AND 0.9 THEN '0.8'
WHEN RAND_VAL BETWEEN 0.9 AND 1 THEN '0.9'
ELSE 'Not yet categorized'
END as dispersion,
rand_val
FROM #RAND_GEN_ANALYSIS) RGA
GROUP BY DISPERSION
ORDER BY DISPERSION
SELECT @MN_RANDS = AVG(NBR_RANDS) FROM @Dna
SELECT DISPERSION,
NBR_RANDS,
@MN_RANDS AS Mean_rands,
CAST((NBR_RANDS / @MN_RANDS) AS DECIMAL(9,8)) AS Mean_deveation,
AVG_RANDS
FROM @Dna
order by Mean_deveation desc
Have fun!
April 21, 2006 at 10:15 am
I've read a lot about how to generate random real-world data in SQl, but yet I haven't found what I'm needing.
In my database there are many dimensions. If I use a Rand() function to generate each dimension ID to populate the Fact table, the distribution or dispersion is even. Wht I need is something to generate random data, but without an even distribution. This is, some of the dimension members must not have any record in the fact table, and the amount of records in the fact for each ID must be different (with Rand() these amounts where very similar)
Thanks
February 22, 2007 at 7:46 am
If u just want to return n rows from a table (or view, etc) at random u can use the newid() approach wich a temp table.
set nocount on
select newid() as rand, tt.*
into #temp
from target_table tt
set nocount off
select top n * from #temp order by rand
You can easily put it in a stored procedure for use with any table
Viewing 10 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply