

SSCForever
Group: General Forum Members
Last Login: Yesterday @ 11:34 PM
Points: 41,633,
Visits: 38,988


Matt Miller (#4) (3/26/2012)
sknox (3/26/2012) For testing purposes (both scientific and software) pseudorandom numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudorandom number generator produces uniformlydistributed data. More on that in a bit.
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any nonuniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option. This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole.
Hmmmm... the constraints on range and domain aren't enough to satisfy this problem? Such constraints could actually form a "bell curve" (or whatever) using a CASE statement to "weight" the outcome of the constrained random generator.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




Hall of Fame
Group: General Forum Members
Last Login: Saturday, October 24, 2015 2:31 AM
Points: 3,158,
Visits: 11,771


Here is an alternate method that I use to generate the pseudo random numbers. The basic method is to take the right 7 bytes from the NEWID function and convert that to a BIGINT before applying the MODULUS operator. No need for the ABS function, since 7 bytes can only produce a positive BIGINT number.
if object_id('tempdb..#t','U') is not null begin drop table #t end
 Generate 20,000,000 rows select top 20000000 NUMBER = identity(int,1,1) into #t from (select top 4473 * from master.dbo.syscolumns) a cross join (select top 4473 * from master.dbo.syscolumns) b
 Show distribution of rowcount around average of 40000 select a.RandomNo, Rows = count(*) from ( select RandomNo = (convert(bigint,convert(varbinary(7),newid()))%500)+1 from #t aa ) a group by a.RandomNo order by count(*), a.RandomNo
RandomNo Rows   335 39455 3 39457 76 39481 426 39489 494 39535 242 39539 278 39539 490 39548 445 39553 244 39566 ... ... ... 124 40400 228 40402 425 40410 286 40434 45 40458 463 40463 373 40531 152 40586
(500 row(s) affected)




SSCertifiable
Group: General Forum Members
Last Login: Friday, October 21, 2016 10:16 PM
Points: 7,491,
Visits: 17,826


Jeff Moden (3/26/2012)
Matt Miller (#4) (3/26/2012)
sknox (3/26/2012) For testing purposes (both scientific and software) pseudorandom numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudorandom number generator produces uniformlydistributed data. More on that in a bit.
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any nonuniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option. This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole. Hmmmm... the constraints on range and domain aren't enough to satisfy this problem? Such constraints could actually form a "bell curve" (or whatever) using a CASE statement to "weight" the outcome of the constrained random generator.
That's kind of what I meant by the "on its own" comment. You can use the random data generator to pull in representative data in all allowed ranges, but you would need to play with the frequency or weight based on how far away from the mean you happen to be. Assuming you have some knowledge of your data, you can shape your test data to match, using the random set as a base.
 Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loudspoken enduser..All right  what was my emergency again?




SSCertifiable
Group: General Forum Members
Last Login: Friday, October 21, 2016 8:31 PM
Points: 5,907,
Visits: 10,320





SSC Eights!
Group: General Forum Members
Last Login: Thursday, October 13, 2016 2:33 PM
Points: 993,
Visits: 1,892


That's a good point to bring up. A random distribution will create a uniform distribution across a range of data
I heartily agree. It's been a long time since I studied statistical distributions but a basic understanding of them is hugely useful. It would be great to have a method of generating random data that approximated a distribution, whether it be Gaussian or exponential decay, or an F distribution or whatever.
A common phenomenon is where a column might validly accept one of say 30 integers. The vast majority of the time people will record one of five values and the frequency of recording of the others tapers off, with a few being used exceeding rarely. If you were testing things like index cardinality and column statistics generation, I wonder whether you'd get more representative testing results if your test data could mimic the distribution of what you expected to occur in production.
One of the symptoms of an approaching nervous breakdown is the belief that one's work is terribly important. Bertrand Russell




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 11:34 PM
Points: 41,633,
Visits: 38,988


Michael Valentine Jones (3/26/2012)
Here is an alternate method that I use to generate the pseudo random numbers. The basic method is to take the right 7 bytes from the NEWID function and convert that to a BIGINT before applying the MODULUS operator. No need for the ABS function, since 7 bytes can only produce a positive BIGINT number. if object_id('tempdb..#t','U') is not null begin drop table #t end
 Generate 20,000,000 rows select top 20000000 NUMBER = identity(int,1,1) into #t from (select top 4473 * from master.dbo.syscolumns) a cross join (select top 4473 * from master.dbo.syscolumns) b
 Show distribution of rowcount around average of 40000 select a.RandomNo, Rows = count(*) from ( select RandomNo = (convert(bigint,convert(varbinary(7),newid()))%500)+1 from #t aa ) a group by a.RandomNo order by count(*), a.RandomNo
RandomNo Rows   335 39455 3 39457 76 39481 426 39489 494 39535 242 39539 278 39539 490 39548 445 39553 244 39566 ... ... ... 124 40400 228 40402 425 40410 286 40434 45 40458 463 40463 373 40531 152 40586
(500 row(s) affected)
Like I said in the article, the conversion to VARBINARY will slow things down and to no good end if you don't really need BIGINT for the random integer. If you really want BIGINT capability (and I realize that wasn't one of your goals in your example), I believe you'd also have to convert the whole NEWID() to VARBINARY.
I also thought you were involved in some testing that showed the use of the square root of the final number of desired rows as a TOP for the self joined table in the Cross Join really wasn't worth it.
The main point that I'm trying to make is that if it's too complicated, folks won't use it.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 11:34 PM
Points: 41,633,
Visits: 38,988


Matt Miller (#4) (3/26/2012)
Jeff Moden (3/26/2012)
Matt Miller (#4) (3/26/2012)
sknox (3/26/2012) For testing purposes (both scientific and software) pseudorandom numbers are preferable to truly random numbers*, because you want to see how the system responds to the entire range of possible inputs. A truly random number source cannot be trusted to give you a representative sample.
* This is, of course, assuming that the pseudorandom number generator produces uniformlydistributed data. More on that in a bit.
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data, but cannot on its own replicate any nonuniform data patterns. So if you're looking to find out if there's a normal distribution in your data (or any number of other patterns across the set), using random data may not be a good option. This would be one of those big caveats in the "why would you need random data". The random set will allow you to test for behavior of a varity of inputs at the detail level, but won't help with test the set as a whole. Hmmmm... the constraints on range and domain aren't enough to satisfy this problem? Such constraints could actually form a "bell curve" (or whatever) using a CASE statement to "weight" the outcome of the constrained random generator. That's kind of what I meant by the "on its own" comment. You can use the random data generator to pull in representative data in all allowed ranges, but you would need to play with the frequency or weight based on how far away from the mean you happen to be. Assuming you have some knowledge of your data, you can shape your test data to match, using the random set as a base. Ah... understood. Thanks, Matt.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 11:34 PM
Points: 41,633,
Visits: 38,988


GPO (3/26/2012)
That's a good point to bring up. A random distribution will create a uniform distribution across a range of data I heartily agree. It's been a long time since I studied statistical distributions but a basic understanding of them is hugely useful. It would be great to have a method of generating random data that approximated a distribution, whether it be Gaussian or exponential decay, or an F distribution or whatever. A common phenomenon is where a column might validly accept one of say 30 integers. The vast majority of the time people will record one of five values and the frequency of recording of the others tapers off, with a few being used exceeding rarely. If you were testing things like index cardinality and column statistics generation, I wonder whether you'd get more representative testing results if your test data could mimic the distribution of what you expected to occur in production.
Hmmmm... maybe there needs to be a Part 4 to this series.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




SSCForever
Group: General Forum Members
Last Login: Yesterday @ 11:34 PM
Points: 41,633,
Visits: 38,988


WayneS (3/26/2012) Excellent article Jeff.
Nice coincedence today... I went to the site to find how you did this, and here's the article explaining it all.
Thanks for taking the time for this really great article that explains the how and why.
I know I said it before but thank you for the time you spent helping with the review.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
Helpful Links: How to post code problems How to post performance problems




Hall of Fame
Group: General Forum Members
Last Login: Wednesday, February 24, 2016 6:28 AM
Points: 3,977,
Visits: 6,431




