Neil Franken (3/1/2010)
Cool article but here is what I have found with newid().
Right lets say the profile(salary) matches 1,5 million rows in the database. SQL server will return all 1,5 million rows then ORDER BY. Now I mentioned that a GUID is very very random. This causes a high cpu load on the server as the poor Server now has to first sort the 1,5 million rows then it can return the top 1000. Think about it. It first has to sort before it can return. I have tested this and it does not matter if I return 1 row or 750 000 out of the 1,5 million rows that matches the query it constantly runs at the same speed. The top can only be applied once the sorting is done. Granted for small tables and non mission critical queries this technique can work well I would not use it on large tables as you will create a bottle neck.
One option, especially if you have an indexed identity column on your source table, is to generate a separate table of random row numbers, create a clustered index on it, and join with the original table.
create table #lookup_table(row_num int)
declare @ctr int, @samplesize int
set @ctr = 0
set @samplesize = 1000 -- for example, a sample size of 1,000 is needed
while @ctr < @samplesize
insert into #lookup_table select abs(checksum(newid()))
set @ctr = @ctr + 1
create clustered index idxc on #lookup_table(row_num)
Do a join on this table and it should go much more quickly, so the entire original table would not be loaded. Not much chance there will be duplicate rows with this method as INT can be up to 2,147,483,647.
"In theory, theory and practice are the same. In practice, they are not."
- Albert Einstein