RE: Generating Test Data: Part 1 - Generating Random Integers and Floats

SSChampion

Points: 12388

March 26, 2012 at 9:15 am

Jeff Moden (3/26/2012)
dwain.c (3/26/2012)
Outstanding article Jeff! Just what the doctor ordered for something I'm working on at this instant.
I can't wait for the purists to berate you for using "pseudo" random numbers though. :w00t:
I might be safe for the next 10 minutes or so. Although the "next" random value is certainly predictable, you'd have to know a fair bit about how NEWID() is generated to predict the next value 😀

For testing purposes (both scientific and software) pseudo-random numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.

* This is, of course, assuming that the pseudo-random number generator produces uniformly-distributed data. More on that in a bit.

Edit: more on that: --

So the question becomes: does abs(checksum(newid())) produce a relatively uniform distribution of values?

To test that, I created a dataset with the following code: (NOTE -- this generated 57 million rows on my test machine -- use with caution!)

select abs(checksum(newid())) as RandValue

into RandIDTesting

from sys.all_columns ac1

cross join sys.all_columns ac2

I then wrote the following code to see how the data is distributed:

declare @RangeMin int = 0

declare @RangeMax int = 2147483647

declare @RangeInc int = 65536

declare @RangeCount int = @RangeMax/@RangeInc

select @RangeMin, @RangeMax, @RangeInc, @RangeCount;

with Ranges as (

select top (@RangeCount+1)

@RangeMin + @RangeInc * (row_number() over (order by (select null))-1) as RangeStart,

@RangeMin + @RangeInc * (row_number() over (order by (select null)))-1 as RangeEnd

from sys.all_columns ac1

cross join sys.all_columns ac2

)

select RangeStart, RangeEnd, (select count(*) from RandIDTesting where RandValue between RangeStart and RangeEnd) as RangeSize

from Ranges

group by RangeStart, RangeEnd

order by RangeStart

This produced a list of ranges and how many of our pseudo-random numbers fell into that range. In my testing, all of the ranges had between roughly 1500 to roughly 1700 numbers in it.

So in this case, this method did produce a relatively uniform sample set. This is not conclusive, but you can methods similar to the above to test for yourself.