Getting Random Results

Question

Post reply

Getting Random Results

Andy Warren

SSC Guru

Points: 119902
More actions
August 9, 2007 at 11:53 am

#153734

Comments posted here are about the content posted at http://www.sqlservercentral.com/columnists/awarren/3175.asp
Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

~Prasad Puranik SSC Veteran Points: 259 More actions · Answer 1

How about this?

http://askgabbar.blogspot.com/2007/03/pick-up-random-records-from-table.html

Didn't check it for performance though.

Adrian Hains Ten Centuries Points: 1322 More actions · Answer 2

I had a need for pseudo randomness (display a few "random" images from a gallery) that I addressed thusly:

-Assign a random value to each row as a column value and have a lookup index on this.

-Each time I want 10 random images, I grab TOP 10 order by abs(@currentRandom - persistedRandom) ASC. So this gives me the nearest 10 rows to whatever my random seek point is. The problem is you will get clusters of results for a given set of persistedRandom values.

I'm sure there are better ways but this worked decent for my needs.

bredins Mr or Mrs. 500 Points: 546 More actions · Answer 3

I have also not had time to benchmark performance, but here is another technique that gives the illusion of randomisation:

ORDER

BY SUBSTRING(REVERSE(REPLACE(table.textcolumn,' ','')),convert(int,rand()*10),1) ASC

This uses the text in a column from the query, reverses it and removes spaces, and then extracts a randomly selected character to order the results.

Obviously this is limited to situations where you have a text column to work with...

james zhang-339284 SSC Rookie Points: 47 More actions · Answer 4

james zhang-339284

SSC Rookie

Points: 47

August 23, 2007 at 6:50 am

#728880

how do you calculate the reads?

Ian Yates SSCoach Points: 19738 More actions · Answer 5

Minor point... At the end of the article where you say you could set the uniqueID column as a pkey I think you mean to set it as a clustered index thus allowing returning of rows in the GUID order without any sorting effort. Common to mix up the pkeys & the clustered index of a table as they often tend to be one and the same if the defaults are accepted

It is tricky getting random data in a set-based fashion. Perhaps you could do something along the lines of

1. Having a unique row number for each row in your table without gaps - an identity column would be perfect

2. Indexing by this column either clustered or using at least a covering index

3. Use some mechanism to generate row numbers to select. Perhaps the pairs of digits in a generated GUID (or triplets, quads, etc depending on how many rows you have)....

Step #3 might be the lengthy & wasteful bit though..... Will give it some more thought

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 6

When I first started to read this, I thought "Jeez... that's a little obvious, Andy. Hardly worth an article." Now that I see how some folks are trying to randomize data, I have to admit that such a short and sweet article about it is well overdue.

I don't believe that anyone will come up with a method that is either faster or more consistently random than just doing the ORDER BY NEWID()... but it is fun to watch folks try.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RyanRandall SSChampion Points: 13623 More actions · Answer 7

Using "ORDER BY CHECKSUM(NEWID())" rather than "ORDER BY NEWID()" apparently produces values with a better random distribution.

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Adrian Hains Ten Centuries Points: 1322 More actions · Answer 8

Ian,

I think this is a good approach and it is actually what some coworkers and I ended up with awhile back (more them than me ). Our problem statement involved noncontiguous ranges, so using an identity column was not sufficient. The solution was to use row_number() ranking function and join against a temp table of random values.

I find that with 3.3 MM rows, this approach is about half the cpu and twice as fast as doing order by newid() for this contrived test. Subsequent executions take about 20% of the cpu that order by newid(), and still about twice as fast.

So when you have many rows and a noncontiguous key to index off of, this approach appears to be superior if not as simple.

-- BEGIN TEST PREP

create table test_table (number int primary key clustered, payload varchar(1));

WITH

-- first CTE which returns 10 rows (0-9)

digits AS (

SELECT 0 as Number

UNION SELECT 1

UNION SELECT 2

UNION SELECT 3

UNION SELECT 4

UNION SELECT 5

UNION SELECT 6

UNION SELECT 7

UNION SELECT 8

UNION SELECT 9

)

-- second CTE which returns 10 million rows by using

-- a CROSS JOIN on the first CTE

, dig AS (

SELECT

(millions.Number * 1000000)

+ (hThousands.Number * 100000)

+ (tThousands.Number * 10000)

+ (thousands.Number * 1000)

+ (hundreds.Number * 100)

+ (tens.Number * 10)

+ ones.Number AS Number

FROM digits AS ones

CROSS JOIN digits AS tens

CROSS JOIN digits AS hundreds

CROSS JOIN digits AS thousands

CROSS JOIN digits AS tThousands

CROSS JOIN digits AS hThousands

CROSS JOIN digits as millions

)

INSERT test_table(number, payload)

SELECT number, 'z'

FROM dig

WHERE number % 3 = 0

go

--END TEST PREP

set statistics time on

set statistics io on

--BEGIN TEST 1

select top 10 number, payload from test_table order by newid();

GO 2

--END TEST 1

--BEGIN TEST 2

DECLARE @TV table(rand_num int)

DECLARE @max-2 int, @num int

select @num=10, @max-2 = count(*) from test_table;

DECLARE @i int

SET @i = 0

WHILE (@i < @num)

BEGIN

INSERT @TV values (ceiling(rand() * @max-2))

SET @i = @i + 1

END

SELECT *

FROM

(

select

row_number() over (order by number) row,

*

from test_table

) a

WHERE a.row in (SELECT rand_num from @TV)

GO 2

--END TEST 2

Andy Warren SSC Guru Points: 119902 More actions · Answer 9

Jeff, I'll grant it is a simple topic - I figured by now I'd get clobbered by some statistics major for even using the word random! Maybe someone with a stats background could analyze all the options posed to see which is most random/best random?:-)

Andy
SQLAndy - My Blog!
Connect with me on LinkedIn
Follow me on Twitter

RyanRandall SSChampion Points: 13623 More actions · Answer 10

I actually needed to use this on something, so I've just spent an hour playing around with it.

Here's some testing I've just done on a Messages table I have which has just over a millon rows.

It compares the NEWID() method, the CHECKSUM(NEWID()) method, and a 3rd method I've just come up with (which is heavily based on the 'important' note in this link: http://msdn2.microsoft.com/en-us/library/ms189108.aspx).

Sadly, I've not had time to play around with the other methods suggested here, and I'm just going to go for Method 3 for the time being (for my data, it's more than twice as fast as newid() on average, and up to 6 times as fast for small sample sizes).

Here's the code I used, with the results underneath...

--table to hold results

CREATE TABLE #Results (NumberOfTableRows INT, NumberOfRows INT, ProbabilityOfEachRow FLOAT,

Method1_TimeInMs INT, Method2_TimeInMs INT, Method3_TimeInMs INT)

DECLARE @Counter TINYINT

SET @Counter = 0

WHILE @Counter <= 10 --perform the comparison this many times

BEGIN

DECLARE @NumberOfRows INT

SET @NumberOfRows = power(10, rand()*5) --random sample size to test with

--

DECLARE @d DATETIME, @t1 INT, @t2 INT, @t3 INT

SET @d = GETDATE();

--Method 1

SELECT TOP (@NumberOfRows) MessageId FROM dbo.Messages ORDER BY NEWID()

--

SET @t1 = DATEDIFF(ms, @d, GETDATE()); SET @d = GETDATE();

--Method 2

SELECT TOP (@NumberOfRows) MessageId FROM dbo.Messages ORDER BY CHECKSUM(NEWID())

--

SET @t2 = DATEDIFF(ms, @d, GETDATE()); SET @d = GETDATE();

--Method 3

DECLARE @NumberOfTableRows INT, @ProbabilityOfEachRow FLOAT

SELECT @NumberOfTableRows = COUNT(*) from dbo.Messages

SET @ProbabilityOfEachRow = (1+SQRT(1.0/@NumberOfRows)*10) * @NumberOfRows / @NumberOfTableRows --weighted overestimate

SELECT TOP (@NumberOfRows) MessageId FROM Messages --select required number from overestimated sample

WHERE @ProbabilityOfEachRow >= CAST(CHECKSUM(NEWID(), MessageId) & 0x7fffffff AS FLOAT) / CAST (0x7fffffff AS INT)

--

SET @t3 = DATEDIFF(ms, @d, GETDATE())

--Insert results

INSERT #Results SELECT @NumberOfTableRows, @NumberOfRows, @ProbabilityOfEachRow, @t1, @t2, @t3

--

SET @Counter = @Counter + 1

END

--select results and tidy up

SELECT *, NumberOfTableRows * ProbabilityOfEachRow as NumberOfRows_OverEstimate FROM #Results

SELECT AVG(Method1_TimeInMs) AS 'Method1', AVG(Method2_TimeInMs) AS 'Method2', AVG(Method3_TimeInMs) AS 'Method3' FROM #Results

--DROP TABLE #Results

/*

NumberOfTableRows NumberOfRows ProbabilityOfEachRow Method1_TimeInMs Method2_TimeInMs Method3_TimeInMs NumberOfRows_OverEstimate

----------------- ------------ ---------------------- ---------------- ---------------- ---------------- -------------------------

1079894 7365 0.00761481754808218 2326 1796 1406 8223.19578126865

1079894 49655 0.0480448452758635 2940 2420 1580 51883.3401443334

1079894 16593 0.0165582345926519 2373 1860 1513 17881.1381871972

1079894 63666 0.0612923233001256 3220 2626 1653 66189.2121778658

1079894 15 4.97547291325635E-05 2143 1546 530 53.7298334618805

1079894 18703 0.0185857029646503 2486 1843 1530 20070.589117308

1079894 12255 0.0123734579816154 2266 1763 1406 13362.0230335986

1079894 8403 0.00863017924625119 2220 1703 1450 9319.67878695118

1079894 10 3.85433909269649E-05 2283 1576 390 41.6227766016838

1079894 1 1.01861849403738E-05 2190 1593 376 11

1079894 6 2.82387877215567E-05 2153 1593 266 30.4948974277828

(11 row(s) affected)

--Average time in ms

Method1 Method2 Method3

----------- ----------- -----------

2418 1847 1100

(1 row(s) affected)

*/

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

Adrian Hains Ten Centuries Points: 1322 More actions · Answer 11

Ryan,

Approach #3 is very interesting, but I can't imagine it outperforms the solution I posted on page 1 of this thread. If you end up testing that one out as well I would be very curious to see how it compares.

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 12

You know me, Ryan... extraordinary claims require extraordinary proof... got code to prove that?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 13

Heh... I'm no stats Ninja, either... but it's a heck of a lot more random with fairly even and unpredictable results over both small and large ranges of data than the other solutions I've seen.

I've tested it about a gazillion times in the past... I'm happy with it.. someone else can write some code to prove otherwise

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003857 More actions · Answer 14

Dang... sorry about that... I responded to the first page and didn't look on the 2nd page... sorry Ryan.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)