random records

  • KoldCoffee (3/4/2014)


    Hi Dixie Flatline, I saw this on another post and was trying to understand it so I appreciate that you explained the code more in this version. I will study it and thanks.

    No problem, KC. Be happy to answer any questions.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/4/2014)


    Masterful? Do you need to borrow money or something, Jason?

    Well, since you asked...:-D

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Here is a patch for the original code that was intended to find a small number of rows (50). It assumes that the key number N in the Million_3 table is not all gaps. It gets speed by avoiding a full table scan and sort.

    Essentially it generates three times the number of randomIDs needed for the result set, then joins the set of randomIDs to the Million_3 table (ensuring that only "hits" are returned), and does a DISTINCT TOP (@rows) to eliminate duplicates. While its theoretically possible for less than @rows to be returned, it is so much faster for small values of @rows that a loop could check and make sure that @rows were actually returned and rerun the query if necessary.

    The attached code will work with the Million_3 table from yesterday's example.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • just a small note. Which method you choose may depend on your environment. By that I mean some environments are more disk bound that CPU, some vice sera, etc.

    If you look at the method I posted back in 12/16/2010 11:17:04 PM you will see that it is slightly slower than the method Dixie posted. However, it is less resource intensive. Granted the difference is rather trivial, but perhaps it is of interest to someone. Here are the statistical differences:

    Dixie's:

    Table 'Million_3'. Scan count 5, logical reads 14085, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1980 ms, elapsed time = 822 ms.

    Lamprey's:

    Table 'Million_3'. Scan count 1, logical reads 9418, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1450 ms, elapsed time = 1476 ms.

  • Personally, I use a dart board. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Lamprey13 (12/16/2010)


    Here is another method I've seen/used that is pretty similar to the ORDER BY NEWID(), but much faster:

    SELECT TOP 10 * FROM MyTable

    WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SomeColumnName) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

    NOTE: Repalce "SomeColumnName" with some columns or set of columns from your table, PKs work well.

    I like the CHECKSUM idea because you can use it for non-integer keys. However there is a caveat.

    This technique would certainly be fast, but its subject to what I refer to as "frontloading". That is to say, because it takes the first ten rows that match its where condition, there is a tendency to pull rows from the first part of the table by whatever sort order is specified.

    Running the above code against the Million_3 table, it always produced selections taken from the first couple thousand rows. Depending on the data, this may defeat the randomness of the sampling.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Jeff Moden (3/4/2014)


    Personally, I use a dart board. 😛

    How hard to you have to fling a pork chop to make it stick, Jeff? 😀

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline (3/4/2014)


    Lamprey13 (12/16/2010)


    Here is another method I've seen/used that is pretty similar to the ORDER BY NEWID(), but much faster:

    SELECT TOP 10 * FROM MyTable

    WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SomeColumnName) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

    NOTE: Repalce "SomeColumnName" with some columns or set of columns from your table, PKs work well.

    I like the CHECKSUM idea because you can use it for non-integer keys. However there is a caveat.

    This technique would certainly be fast, but its subject to what I refer to as "frontloading". That is to say, because it takes the first ten rows that match its where condition, there is a tendency to pull rows from the first part of the table by whatever sort order is specified.

    Running the above code against the Million_3 table, it always produced selections taken from the first couple thousand rows. Depending on the data, this may defeat the randomness of the sampling.

    Fully agreed.

    Like you mentioned it does work on other datatypes. However, it does skew things a bit in its "randomness." I did get slightly different results in that it always got over 2m. So, that may not be random enough for some.

    Just food for though depending on people needs. 🙂

  • Okay, I just had to do some testing.

    If frontloading isn't an issue, you can use the following in place of the CHECKSUM test and get similar results with fewer CPU cycles. However, as Jason pointed out, selection of random rows is often a one-off task. Like you said, it depends on the circumstances of each problem.

    ---------------------

    ;with cte as (SELECT *, ABS(CHECKSUM(NEWID())) % 1000 as randomNo

    FROM Million_3)

    select top 50 *

    from cte where randomNo < 10

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • The Dixie Flatline - Friday, December 17, 2010 11:53 AM

    Here is an example of pulling 20 random rows from a million row table.Note to Jeff Moden: It may not be "Nasty Fast" but it's "Wicked Quick" :-DHad to make this an attachment. For some reason, SSC is blocking my posts.

    Add a random seed to it to make it more random.  Use the thousanth's of a second integer as the seed.  The odds of getting the exact same hundredth of a second is pretty rare and makes it start at random point in the random number table.  computers can't generate real random numbers, they use random number tables previously generated and successive random numbers follow the sequence of that table (so not really random).  If you start up the computer and follow the exact same sequence you will generate the same random numbers each time.  Rand(datepart(ms, getdate())) will make it start out at a pretty random point in the table each time.  Hard to hit the enter key at the exact same thousandth of a second.

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply