Getting Random Results

  • How about this?

    http://askgabbar.blogspot.com/2007/03/pick-up-random-records-from-table.html

    Didn't check it for performance though.

  • I had a need for pseudo randomness (display a few "random" images from a gallery) that I addressed thusly:

    -Assign a random value to each row as a column value and have a lookup index on this.

    -Each time I want 10 random images, I grab TOP 10 order by abs(@currentRandom - persistedRandom) ASC. So this gives me the nearest 10 rows to whatever my random seek point is. The problem is you will get clusters of results for a given set of persistedRandom values.

    I'm sure there are better ways but this worked decent for my needs.

  • I have also not had time to benchmark performance, but here is another technique that gives the illusion of randomisation:

    ORDER

    BY SUBSTRING(REVERSE(REPLACE(table.textcolumn,' ','')),convert(int,rand()*10),1) ASC

    This uses the text in a column from the query, reverses it and removes spaces, and then extracts a randomly selected character to order the results.

    Obviously this is limited to situations where you have a text column to work with...

  • how do you calculate the reads?

  • Minor point... At the end of the article where you say you could set the uniqueID column as a pkey I think you mean to set it as a clustered index thus allowing returning of rows in the GUID order without any sorting effort.  Common to mix up the pkeys & the clustered index of a table as they often tend to be one and the same if the defaults are accepted

    It is tricky getting random data in a set-based fashion.  Perhaps you could do something along the lines of

    1. Having a unique row number for each row in your table without gaps - an identity column would be perfect

    2. Indexing by this column either clustered or using at least a covering index

    3. Use some mechanism to generate row numbers to select. Perhaps the pairs of digits in a generated GUID (or triplets, quads, etc depending on how many rows you have)....

    Step #3 might be the lengthy & wasteful bit though.....  Will give it some more thought

     

  • When I first started to read this, I thought "Jeez... that's a little obvious, Andy.  Hardly worth an article."   Now that I see how some folks are trying to randomize data, I have to admit that such a short and sweet article about it is well overdue.

    I don't believe that anyone will come up with a method that is either faster or more consistently random than just doing the ORDER BY NEWID()... but it is fun to watch folks try.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Using "ORDER BY CHECKSUM(NEWID())" rather than "ORDER BY NEWID()" apparently produces values with a better random distribution.

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Ian,

    I think this is a good approach and it is actually what some coworkers and I ended up with awhile back (more them than me ). Our problem statement involved noncontiguous ranges, so using an identity column was not sufficient. The solution was to use row_number() ranking function and join against a temp table of random values.

    I find that with 3.3 MM rows, this approach is about half the cpu and twice as fast as doing order by newid() for this contrived test. Subsequent executions take about 20% of the cpu that order by newid(), and still about twice as fast.

    So when you have many rows and a noncontiguous key to index off of, this approach appears to be superior if not as simple.

    -- BEGIN TEST PREP

    create table test_table (number int primary key clustered, payload varchar(1));

    WITH

    -- first CTE which returns 10 rows (0-9)

    digits AS (

    SELECT 0 as Number

    UNION SELECT 1

    UNION SELECT 2

    UNION SELECT 3

    UNION SELECT 4

    UNION SELECT 5

    UNION SELECT 6

    UNION SELECT 7

    UNION SELECT 8

    UNION SELECT 9

    )

    -- second CTE which returns 10 million rows by using

    -- a CROSS JOIN on the first CTE

    , dig AS (

    SELECT

    (millions.Number * 1000000)

    + (hThousands.Number * 100000)

    + (tThousands.Number * 10000)

    + (thousands.Number * 1000)

    + (hundreds.Number * 100)

    + (tens.Number * 10)

    + ones.Number AS Number

    FROM digits AS ones

    CROSS JOIN digits AS tens

    CROSS JOIN digits AS hundreds

    CROSS JOIN digits AS thousands

    CROSS JOIN digits AS tThousands

    CROSS JOIN digits AS hThousands

    CROSS JOIN digits as millions

    )

    INSERT test_table(number, payload)

    SELECT number, 'z'

    FROM dig

    WHERE number % 3 = 0

    go

    --END TEST PREP

    set statistics time on

    set statistics io on

    --BEGIN TEST 1

    select top 10 number, payload from test_table order by newid();

    GO 2

    --END TEST 1

    --BEGIN TEST 2

    DECLARE @TV table(rand_num int)

    DECLARE @max-2 int, @num int

    select @num=10, @max-2 = count(*) from test_table;

    DECLARE @i int

    SET @i = 0

    WHILE (@i < @num)

    BEGIN

    INSERT @TV values (ceiling(rand() * @max-2))

    SET @i = @i + 1

    END

    SELECT *

    FROM

    (

    select

    row_number() over (order by number) row,

    *

    from test_table

    ) a

    WHERE a.row in (SELECT rand_num from @TV)

    GO 2

    --END TEST 2

  • Jeff, I'll grant it is a simple topic - I figured by now I'd get clobbered by some statistics major for even using the word random! Maybe someone with a stats background could analyze all the options posed to see which is most random/best random?:-)

  • I actually needed to use this on something, so I've just spent an hour playing around with it.

     

    Here's some testing I've just done on a Messages table I have which has just over a millon rows.

     

    It compares the NEWID() method, the CHECKSUM(NEWID()) method, and a 3rd method I've just come up with (which is heavily based on the 'important' note in this link: http://msdn2.microsoft.com/en-us/library/ms189108.aspx).

     

    Sadly, I've not had time to play around with the other methods suggested here, and I'm just going to go for Method 3 for the time being (for my data, it's more than twice as fast as newid() on average, and up to 6 times as fast for small sample sizes).

     

    Here's the code I used, with the results underneath...

     

     

    --table to hold results

    CREATE TABLE #Results (NumberOfTableRows INT, NumberOfRows INT, ProbabilityOfEachRow FLOAT,

           Method1_TimeInMs INT, Method2_TimeInMs INT, Method3_TimeInMs INT)

     

    DECLARE @Counter TINYINT

    SET @Counter = 0

    WHILE @Counter <= 10 --perform the comparison this many times

    BEGIN

           DECLARE @NumberOfRows INT

           SET @NumberOfRows = power(10, rand()*5) --random sample size to test with

           --

           DECLARE @d DATETIME, @t1 INT, @t2 INT, @t3 INT

           SET @d = GETDATE();

     

           --Method 1

           SELECT TOP (@NumberOfRows) MessageId FROM dbo.Messages ORDER BY NEWID()

     

           --

           SET @t1 = DATEDIFF(ms, @d, GETDATE()); SET @d = GETDATE();

     

           --Method 2

           SELECT TOP (@NumberOfRows) MessageId FROM dbo.Messages ORDER BY CHECKSUM(NEWID())

     

           --

           SET @t2 = DATEDIFF(ms, @d, GETDATE()); SET @d = GETDATE();

     

           --Method 3

           DECLARE @NumberOfTableRows INT, @ProbabilityOfEachRow FLOAT

           SELECT @NumberOfTableRows = COUNT(*) from dbo.Messages

           SET @ProbabilityOfEachRow = (1+SQRT(1.0/@NumberOfRows)*10) * @NumberOfRows / @NumberOfTableRows --weighted overestimate

     

           SELECT TOP (@NumberOfRows) MessageId FROM Messages --select required number from overestimated sample

           WHERE @ProbabilityOfEachRow >= CAST(CHECKSUM(NEWID(), MessageId) & 0x7fffffff AS FLOAT) / CAST (0x7fffffff AS INT)

     

           --

           SET @t3 = DATEDIFF(ms, @d, GETDATE())

     

           --Insert results

           INSERT #Results SELECT @NumberOfTableRows, @NumberOfRows, @ProbabilityOfEachRow, @t1, @t2, @t3

           --

           SET @Counter = @Counter + 1

    END

     

    --select results and tidy up

    SELECT *, NumberOfTableRows * ProbabilityOfEachRow as NumberOfRows_OverEstimate FROM #Results

    SELECT AVG(Method1_TimeInMs) AS 'Method1', AVG(Method2_TimeInMs) AS 'Method2', AVG(Method3_TimeInMs) AS 'Method3' FROM #Results

    --DROP TABLE #Results

     

    /*

    NumberOfTableRows NumberOfRows ProbabilityOfEachRow   Method1_TimeInMs Method2_TimeInMs Method3_TimeInMs NumberOfRows_OverEstimate

    ----------------- ------------ ---------------------- ---------------- ---------------- ---------------- -------------------------

    1079894           7365         0.00761481754808218    2326             1796             1406             8223.19578126865

    1079894           49655        0.0480448452758635     2940             2420             1580             51883.3401443334

    1079894           16593        0.0165582345926519     2373             1860             1513             17881.1381871972

    1079894           63666        0.0612923233001256     3220             2626             1653             66189.2121778658

    1079894           15           4.97547291325635E-05   2143             1546             530              53.7298334618805

    1079894           18703        0.0185857029646503     2486             1843             1530             20070.589117308

    1079894           12255        0.0123734579816154     2266             1763             1406             13362.0230335986

    1079894           8403         0.00863017924625119    2220             1703             1450             9319.67878695118

    1079894           10           3.85433909269649E-05   2283             1576             390              41.6227766016838

    1079894           1            1.01861849403738E-05   2190             1593             376              11

    1079894           6            2.82387877215567E-05   2153             1593             266              30.4948974277828

     

    (11 row(s) affected)

     

    --Average time in ms

     

    Method1     Method2     Method3

    ----------- ----------- -----------

    2418        1847        1100

     

    (1 row(s) affected)

    */

     

     

    Ryan Randall

    Solutions are easy. Understanding the problem, now, that's the hard part.

  • Ryan,

    Approach #3 is very interesting, but I can't imagine it outperforms the solution I posted on page 1 of this thread. If you end up testing that one out as well I would be very curious to see how it compares.

  • You know me, Ryan... extraordinary claims require extraordinary proof... got code to prove that?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh... I'm no stats Ninja, either... but it's a heck of a lot more random with fairly even and unpredictable results over both small and large ranges of data than the other solutions I've seen.

    I've tested it about a gazillion times in the past... I'm happy with it.. someone else can write some code to prove otherwise

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Dang... sorry about that... I responded to the first page and didn't look on the 2nd page... sorry Ryan.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply