Generating Test Data: Part 1 - Generating Random Integers and Floats

Question

Post reply

Generating Test Data: Part 1 - Generating Random Integers and Floats

Viewing 15 posts - 16 through 30 (of 61 total)

You must be logged in to reply to this topic. Login to reply

Johan Bijnens SSC Guru Points: 135615 More actions · Answer 1

Great extrapolation of the KISS principle, Jeff.

Need it to be sead ... [font="Times New Roman"]I LOVE IT[/font] :w00t:

Johan

Learn to play, play to learn !

Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:

- How to post Performance Problems
- How to post data and code to get the best help

- How to prevent a sore throat after hours of presenting ppt

press F1 for solution, press shift+F1 for urgent solution 😀

Who am I ? Sometimes this is me but most of the time this is me

Scott Abrants SSCommitted Points: 1503 More actions · Answer 2

Excellent post! Great examples, great code, and easy to follow!

Nice job Jeff!

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 3

Krtyknm (3/26/2012)
Hi Jeff,
Nice Article, Keep going.
I have a question on sys tables, most of them using the Sys.tables for generating random numbers. Assume that developers don't have an access to System tables, then how can they get the random numbers.
Thanks,
Karthik

In addition to what has already been said about using other tables, or creating a tally table, it is also possible to create a dynamic tally table using CTEs in SQL Server 2005 and newer. You can find numerous examples of these in the forums and articles on SSC.

sknox SSChampion Points: 12388 More actions · Answer 4

Jeff Moden (3/26/2012)
dwain.c (3/26/2012)
Outstanding article Jeff! Just what the doctor ordered for something I'm working on at this instant.
I can't wait for the purists to berate you for using "pseudo" random numbers though. :w00t:
I might be safe for the next 10 minutes or so. Although the "next" random value is certainly predictable, you'd have to know a fair bit about how NEWID() is generated to predict the next value 😀

For testing purposes (both scientific and software) pseudo-random numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.

* This is, of course, assuming that the pseudo-random number generator produces uniformly-distributed data. More on that in a bit.

Edit: more on that: --

So the question becomes: does abs(checksum(newid())) produce a relatively uniform distribution of values?

To test that, I created a dataset with the following code: (NOTE -- this generated 57 million rows on my test machine -- use with caution!)

select abs(checksum(newid())) as RandValue

into RandIDTesting

from sys.all_columns ac1

cross join sys.all_columns ac2

I then wrote the following code to see how the data is distributed:

declare @RangeMin int = 0

declare @RangeMax int = 2147483647

declare @RangeInc int = 65536

declare @RangeCount int = @RangeMax/@RangeInc

select @RangeMin, @RangeMax, @RangeInc, @RangeCount;

with Ranges as (

select top (@RangeCount+1)

@RangeMin + @RangeInc * (row_number() over (order by (select null))-1) as RangeStart,

@RangeMin + @RangeInc * (row_number() over (order by (select null)))-1 as RangeEnd

from sys.all_columns ac1

cross join sys.all_columns ac2

)

select RangeStart, RangeEnd, (select count(*) from RandIDTesting where RandValue between RangeStart and RangeEnd) as RangeSize

from Ranges

group by RangeStart, RangeEnd

order by RangeStart

This produced a list of ranges and how many of our pseudo-random numbers fell into that range. In my testing, all of the ranges had between roughly 1500 to roughly 1700 numbers in it.

So in this case, this method did produce a relatively uniform sample set. This is not conclusive, but you can methods similar to the above to test for yourself.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 5

sknox (3/26/2012)
For testing purposes (both scientific and software) pseudo-random numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudo-random number generator produces uniformly-distributed data. More on that in a bit.

That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any non-uniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option.

This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 6

Matt Miller (#4) (3/26/2012)
sknox (3/26/2012)
For testing purposes (both scientific and software) pseudo-random numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudo-random number generator produces uniformly-distributed data. More on that in a bit.
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any non-uniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option.
This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole.

Hmmmm... the constraints on range and domain aren't enough to satisfy this problem? Such constraints could actually form a "bell curve" (or whatever) using a CASE statement to "weight" the outcome of the constrained random generator.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Michael Valentine Jones SSC Guru Points: 64818 More actions · Answer 7

Here is an alternate method that I use to generate the pseudo random numbers. The basic method is to take the right 7 bytes from the NEWID function and convert that to a BIGINT before applying the MODULUS operator. No need for the ABS function, since 7 bytes can only produce a positive BIGINT number.

if object_id('tempdb..#t','U') is not null begin drop table #t end

-- Generate 20,000,000 rows

select top 20000000

NUMBER = identity(int,1,1)

into

#t

from

(select top 4473 * from master.dbo.syscolumns) a

cross join

(select top 4473 * from master.dbo.syscolumns) b

-- Show distribution of rowcount around average of 40000

select

a.RandomNo,

Rows = count(*)

from

(

select

RandomNo =

(convert(bigint,convert(varbinary(7),newid()))%500)+1

from

#t aa

) a

group by

a.RandomNo

order by

count(*),

a.RandomNo

RandomNo Rows

-------------------- -----------

335 39455

3 39457

76 39481

426 39489

494 39535

242 39539

278 39539

490 39548

445 39553

244 39566

...

124 40400

228 40402

425 40410

286 40434

45 40458

463 40463

373 40531

152 40586

(500 row(s) affected)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 8

Jeff Moden (3/26/2012)
Matt Miller (#4) (3/26/2012)
sknox (3/26/2012)
For testing purposes (both scientific and software) pseudo-random numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudo-random number generator produces uniformly-distributed data. More on that in a bit.
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any non-uniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option.
This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole.
Hmmmm... the constraints on range and domain aren't enough to satisfy this problem? Such constraints could actually form a "bell curve" (or whatever) using a CASE statement to "weight" the outcome of the constrained random generator.

That's kind of what I meant by the "on its own" comment. You can use the random data generator to pull in representative data in all allowed ranges, but you would need to play with the frequency or weight based on how far away from the mean you happen to be. Assuming you have some knowledge of your data, you can shape your test data to match, using the random set as a base.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

WayneS SSC Guru Points: 95461 More actions · Answer 9

Excellent article Jeff.

Nice coincedence today... I went to the site to find how you did this, and here's the article explaining it all.

Thanks for taking the time for this really great article that explains the how and why.

Wayne
Microsoft Certified Master: SQL Server 2008
Author - SQL Server T-SQL Recipes

If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!

Links:
For better assistance in answering your questions
Performance Problems
Common date/time routines
Understanding and Using APPLY Part 1 & Part 2

GPO SSCarpal Tunnel Points: 4574 More actions · Answer 10

That's a good point to bring up. A random distribution will create a uniform distribution across a range of data

I heartily agree. It's been a long time since I studied statistical distributions but a basic understanding of them is hugely useful. It would be great to have a method of generating random data that approximated a distribution, whether it be Gaussian or exponential decay, or an F distribution or whatever.

A common phenomenon is where a column might validly accept one of say 30 integers. The vast majority of the time people will record one of five values and the frequency of recording of the others tapers off, with a few being used exceeding rarely. If you were testing things like index cardinality and column statistics generation, I wonder whether you'd get more representative testing results if your test data could mimic the distribution of what you expected to occur in production.

...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 11

Michael Valentine Jones (3/26/2012)
Here is an alternate method that I use to generate the pseudo random numbers. The basic method is to take the right 7 bytes from the NEWID function and convert that to a BIGINT before applying the MODULUS operator. No need for the ABS function, since 7 bytes can only produce a positive BIGINT number.
if object_id('tempdb..#t','U') is not null begin drop table #t end
-- Generate 20,000,000 rows
select top 20000000
NUMBER = identity(int,1,1)
into
#t
from
(select top 4473 * from master.dbo.syscolumns) a
cross join
(select top 4473 * from master.dbo.syscolumns) b
-- Show distribution of rowcount around average of 40000
select
a.RandomNo,
Rows = count(*)
from
(
select
RandomNo =
(convert(bigint,convert(varbinary(7),newid()))%500)+1
from
#t aa
) a
group by
a.RandomNo
order by
count(*),
a.RandomNo
RandomNo Rows
-------------------- -----------
335 39455
3 39457
76 39481
426 39489
494 39535
242 39539
278 39539
490 39548
445 39553
244 39566
...
...
...
124 40400
228 40402
425 40410
286 40434
45 40458
463 40463
373 40531
152 40586
(500 row(s) affected)

Like I said in the article, the conversion to VARBINARY will slow things down and to no good end if you don't really need BIGINT for the random integer. If you really want BIGINT capability (and I realize that wasn't one of your goals in your example), I believe you'd also have to convert the whole NEWID() to VARBINARY.

I also thought you were involved in some testing that showed the use of the square root of the final number of desired rows as a TOP for the self joined table in the Cross Join really wasn't worth it.

The main point that I'm trying to make is that if it's too complicated, folks won't use it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

Matt Miller (#4) (3/26/2012)
Jeff Moden (3/26/2012)
Matt Miller (#4) (3/26/2012)
sknox (3/26/2012)
For testing purposes (both scientific and software) pseudo-random numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudo-random number generator produces uniformly-distributed data. More on that in a bit.
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any non-uniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option.
This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole.
Hmmmm... the constraints on range and domain aren't enough to satisfy this problem? Such constraints could actually form a "bell curve" (or whatever) using a CASE statement to "weight" the outcome of the constrained random generator.
That's kind of what I meant by the "on its own" comment. You can use the random data generator to pull in representative data in all allowed ranges, but you would need to play with the frequency or weight based on how far away from the mean you happen to be. Assuming you have some knowledge of your data, you can shape your test data to match, using the random set as a base.

Ah... understood. Thanks, Matt.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 13

GPO (3/26/2012)
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data
I heartily agree. It's been a long time since I studied statistical distributions but a basic understanding of them is hugely useful. It would be great to have a method of generating random data that approximated a distribution, whether it be Gaussian or exponential decay, or an F distribution or whatever.
A common phenomenon is where a column might validly accept one of say 30 integers. The vast majority of the time people will record one of five values and the frequency of recording of the others tapers off, with a few being used exceeding rarely. If you were testing things like index cardinality and column statistics generation, I wonder whether you'd get more representative testing results if your test data could mimic the distribution of what you expected to occur in production.

Hmmmm... maybe there needs to be a Part 4 to this series.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 14

WayneS (3/26/2012)
Excellent article Jeff.
Nice coincedence today... I went to the site to find how you did this, and here's the article explaining it all.
Thanks for taking the time for this really great article that explains the how and why.

I know I said it before but thank you for the time you spent helping with the review.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Dwain Camps SSC Guru Points: 86978 More actions · Answer 15

GPO said:

It would be great to have a method of generating random data that approximated a distribution, whether it be Gaussian or exponential decay, or an F distribution or whatever.

The approach requires multiplying the numbers in the uniform distribution by the inverse of the new distribution's probability function. This is not for the faint of heart. I've done it before (not in SQL) for a Weibull distribution.

This article shows how it can be done for a Gaussian distribution:

http://murison.alpheratz.net/Maple/GaussianDistribution/GaussianDistribution.pdf

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St