Generating Non-uniform Random Numbers with SQL

  • Very nice and informative Dwain Camps πŸ™‚


    Madhivanan

    Failing to plan is Planning to fail

  • Sometimes I forget how low I am on the scale of intellect. Thanks for reminding me where I stand.

  • CELKO (7/4/2012)


    Beautiful piece of work! I have needed thius for along time and shodul drop it into one of my books.

    A million years ago, DEC was a conpany that made early mini-computers and they had a laboratory/scientific group. One of the optional pieces of hardware for the PDP-10 was a circuit card with a speck of something radioactive used to generate random numbers.

    Of course it had the yelliow & black trefoil warning symbol on it. That was not a good marketing feature πŸ™‚

    <CRASH>The sound you just heard was me falling out of my chair!</CRASH>

    Now that I've picked myself up and taken a moment to compose myself, I truly appreciate the compliment. Having read many of your posts to this forum, I'd say it most certainly is... different!

    If you do happen to use it in one of your books, a credit would be nice. 😎

    Thanks again!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Madhivanan and Jason,

    Thanks for your interest and I hope that you find something in the work that will serve you well.

    And GPO - thanks for stopping by again. The fish is just a fish. Poisson would probably be relieved.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • In reviewing my own work, I found a few corrections that should be considered:

    Errata

    --------

    1. The reported PopulationMean for the Uniform Distribution is shown as 4. The correct value of 0.5 is specified in the preceding text. Must have been my fat fingers or something that caused that but that error may also occur in the attached SQL.

    2. There are a couple of variables commented out in the test harness, which I was using to test a POISSON distribution. Since that distribution couldn't be made to work in a FUNCTION context, these can be ignored.

    3. In the RN_MULTINOMIAL FUNCTION, the end-point (if the URN is 1) may not always generated the desired result. Since RAND() always delivers a URN on the open interval {0,1}, meaning that 1 will never be the URN delivered, this should not pose an issue. However if you use some other means to generate URNs on a closed interval you may need to adjust that.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Impressive article, Dwain!

    Thanks for sharing.

    -- Gianluca Sartori

  • Gianluca Sartori (7/5/2012)


    Impressive article, Dwain!

    Thanks for sharing.

    Gianluca - Thanks for the positive comments and I hope the effort may be useful some day!


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/4/2012)


    ... The fish is just a fish. Poisson would probably be relieved.

    That's no ordinary fish Dwain, that's a biggun, it stands out from the crowd - like your most excellent article. I had no idea that random numbers could be shaped like this.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • GPO (7/3/2012)


    Don't suppose that picture of the fish is a veiled reference to poisson?

    Whatever the case, I'll just bet it scales well. πŸ˜›

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/6/2012)


    GPO (7/3/2012)


    Don't suppose that picture of the fish is a veiled reference to poisson?

    Whatever the case, I'll just bet it scales well. πŸ˜›

    NOOOOOooooooooo!!!

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • NICE WORK DWAIN! This is more about random numbers that I thought I'd ever learn. I'm going to refer folks to it in Part 3 of the "Generating test data" series.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/6/2012)


    GPO (7/3/2012)


    Don't suppose that picture of the fish is a veiled reference to poisson?

    Whatever the case, I'll just bet it scales well. πŸ˜›

    If you guys are going to make fun of my fish, you'll force me to tell you the story about the one that got away! πŸ˜›

    Seriously, thanks Jeff and Chris for the kind reviews.

    Jeff- Of course I'd be honored for you to give this article a mention in your upcoming Generating Test Data article. Who knows? If I get mentioned by you and Mr. Celko often enough, perhaps some of your fame and glory might rub off on me. πŸ˜€


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Nice article. It's great to see some real maths in SQL.

    For most practical cases, there's a reasonable way of getting a Poisson distributed random variable; not if ? is too large (but even the algorithm in Knuth is pretty awful if ? is too large). It's the inverse transform method. The CDF for the Poisson distribution is known analytically, is discrete, and can be calculated efficiently provided a memo function is used for it (ie provided there is a table - which can initially be empty, maybe it's a temp table - recording values up to the highest point calculated so far; for this application that table needs to have the cumulative value as the clustering key, and the corresponding variable value as the other column). That means that only a single URN is needed to deliver a Poisson RN, which is given by taking variable value from the row with the least CDF greater than or equal to the URN parameter to the function (extending the table by computing extra rows if needed).

    For cases where the inverse transform method is unsuitable (excessively large values of lambda) it's probably good enough to approximate the poisson distribution as a normal distribution; if that isn't good enough, I can't imagine what will be - a very high precision real type would increase the range at which inverse transform or Knuth would work well, but most languages don't have such a type.

    Tom

  • L' Eomot InversΓ© (7/16/2012)


    Nice article. It's great to see some real maths in SQL.

    For most practical cases, there's a reasonable way of getting a Poisson distributed random variable; not if ? is too large (but even the algorithm in Knuth is pretty awful if ? is too large). It's the inverse transform method. The CDF for the Poisson distribution is known analytically, is discrete, and can be calculated efficiently provided a memo function is used for it (ie provided there is a table - which can initially be empty, maybe it's a temp table - recording values up to the highest point calculated so far; for this application that table needs to have the cumulative value as the clustering key, and the corresponding variable value as the other column). That means that only a single URN is needed to deliver a Poisson RN, which is given by taking variable value from the row with the least CDF greater than or equal to the URN parameter to the function (extending the table by computing extra rows if needed).

    For cases where the inverse transform method is unsuitable (excessively large values of lambda) it's probably good enough to approximate the poisson distribution as a normal distribution; if that isn't good enough, I can't imagine what will be - a very high precision real type would increase the range at which inverse transform or Knuth would work well, but most languages don't have such a type.

    Tom,

    Thanks for the insight into Poisson's distribution! If I ever get around to a part II, I'll certainly look up the inverse transform method you've described.

    I'm a little surprised at the suggestion to use a Normal distribution as an approximation. I always thought the exponential distribution was closer, however I say this without doing any research into the matter.

    Thanks for dropping by and taking an interest. I agree, that more math-oriented SQL needs to be surfaced to assist engineers and scientists that need to rely on their own devices for the most part.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (7/16/2012)


    I'm a little surprised at the suggestion to use a Normal distribution as an approximation. I always thought the exponential distribution was closer, however I say this without doing any research into the matter.

    Exponential may be a fair approximation for the right hand tail of the distribution (I'm not sure what it's like there, but I suspect it's fairly poor) but it is absolutely hopeless for the left hand part.The exponential approximation grossly overestimates the probability of any value below the median, so an attempt to generate values with a Poisson distribution using the exponential approximation would generate too many values below ? and too few above, and the first differential of the frequency distribution for the generated values below ? would have the wrong sign at every point. And ?, as well as being the mean, is close to the median too - (median-?) lies bewen -0.7 and +1/3 - so the area with the inverted gradient would be about half the target distribution and a lot more than half of teh generated distribution.

    Tom

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply