Randomizing Result Sets with NEWID

Question

Randomizing Result Sets with NEWID

seth delconte

SSCertifiable

Points: 6388
More actions
February 27, 2010 at 11:44 am

#217123

Comments posted to this topic are about the item Randomizing Result Sets with NEWID
_________________________________
seth delconte
http://sqlkeys.com

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Norman Rasmussen Ten Centuries Points: 1055 More actions · Answer 1

For the purpose of the article, you probably want to select all customers with order_count >= min(order_count of top 10). That way you don't exclude customers that sort 'later' and might otherwise get excluded.

Neil Franken SSC-Addicted Points: 431 More actions · Answer 2

Hi There

Cool article but here is what I have found with newid().

Doing a order by newid() is a performance killer. Let me explain. We have a table of 36 million prospective customers. We send leads to sales agent centres daily. They want random data. Great so we have been using order by newid() for ages. The problem is that is it extremely slow. Painfully slow. Here is why. A GUID returned by a newid() operation is essentially a very random number. Keep this fact in mind. In my daily tasks we send various amounts of leads out of our system for different centres. Basically we have a query like this(simplified the select for readability).

SELECT TOP 1000 leadname,contactdetails -- The TOP is variable per call centre

FROM Prospects

WHERE Salary=>2500 -- we match our prospect profile here

Right lets say the profile(salary) matches 1,5 million rows in the database. SQL server will return all 1,5 million rows then ORDER BY. Now I mentioned that a GUID is very very random. This causes a high cpu load on the server as the poor Server now has to first sort the 1,5 million rows then it can return the top 1000. Think about it. It first has to sort before it can return. I have tested this and it does not matter if I return 1 row or 750 000 out of the 1,5 million rows that matches the query it constantly runs at the same speed. The top can only be applied once the sorting is done. Granted for small tables and non mission critical queries this technique can work well I would not use it on large tables as you will create a bottle neck.

For larger tables it might worth randomizing the data on insert and not having to worry about the randomization during extraction. By the way newid() on a table as the clustered key is not a good idea as the fragmentation of your tables will remain consistently high.

Hope that helps.

Regards

adish SSChasing Mays Points: 634 More actions · Answer 3

I've been using GUID as PKs, but this is a novel way of using it. Great.

gary-915906 Valued Member Points: 74 More actions · Answer 4

Id agree - newid() is a performance killer.

its much faster to do something like this...

DECLARE @RandomNumber float

DECLARE @RandomInteger int

DECLARE @MaxValue int

DECLARE @MinValue int

SELECT @MinValue = MIN(Id),

@MaxValue = MAX(Id)

FROM dbo.SomeTable

SELECT @RandomNumber = RAND()

SELECT @RandomInteger = ((@MaxValue + 1) - @MinValue) * @RandomNumber + @MinValue

SELECT TOP (1) *

FROM dbo.SomeTable

WHERE Id >= @RandomInteger

DannyS SSC Veteran Points: 231 More actions · Answer 5

GUID and SQL Server function Newid() create globally unique identifiers. This is not they same as random and likely does not have very good random properties.

One digit (16 bits) is used to id the algorithm. I'm sure a good many bits represent the time of generation.

If your using SQL Server 2008, then CRYPT_GEN_RANDOM(n) (n= number of digits), creates a cryptographically secure pseudo random number. These are usually the best available without a true hardware random number generator and they do execute for each row, unlike rand()

If your on an earlier version, but have 2008 available consider a linked query to obtain the numbers, or write a custom CLR ( see System.Security.Cryptography.RNGCryptoServiceProvider. )

David.Poole SSC Guru Points: 75899 More actions · Answer 6

I've done quite a bit of testing of NEWID() and randomness and it is a very effective random number generator.

As the GUID comprises of blocks of hexadecimal these blocks can be converted into integer values.

I took a customer file where the PK was a GUID and based a partitioning scheme on a converted integer value and modulo 16 and it ended up with near ideal partition distribution.

Try the following

DECLARE

@test-2 char(4),

@MyInt INT,

@SQL NVARCHAR(200),

@ParmDefinition NVARCHAR(200)

SET @test-2=LEFT(NEWID(),4)

SET @SQL='SET @MyInt=CONVERT(INT,0x'+@Test+')'

SET @ParmDefinition=N'@MyInt int OUTPUT'

EXEC sp_Executesql

@SQL,

@ParmDefinition,

@MyInt = @MyInt OUTPUT

SELECT @test-2,@MyInt

LinkedIn Profile

DannyS SSC Veteran Points: 231 More actions · Answer 7

The values generated have some nice properties, but aren't random.

Converting the first 8 digits to an integer, hex2dec(left([guid],8)), the numbers have definate correlation with the previous generated value, and the 16th previous value, even when the generated ID are generated by deecidely unpredictable times (visitors to a website taking a particular action)

In the prize example, every 16th person might have a 10% or more greater chance than the others.

Martin Vrieze SSCrazy Points: 2760 More actions · Answer 8

I like the concept of using a guid as a pseudo random number. Very novel approach!!!

I've always used auxillary tables in tempdb combined with forward-only cursors to assign unique, random numbers to each record for setting up direct marketing test panels...I generally do not use cursors but this was quite fast for my needs and direct marketing / database marketing queries are mostly ad-hoc in nature anyhow and my systems have not had to worry about performance hits like a production OLTP system would.

This alternate approach you outline IS going to get tested.

Well Done!!!

Kendal Van Dyke SSCertifiable Points: 5972 More actions · Answer 9

Using NEWID to do a random sort or grab a random number of rows from a result set is a HUGE performance killer and does not scale well. I've had developers slip this kind of stuff into production and in less than a minute the CPUs were pegged at 100%.

Wile this method works, I do not recommend it on anything beyond one time ad-hoc DBA queries or infrequently used applications. You can sort randomly much more efficiently using RAND. Rather than type a lengthy explanation here of how I will submit an article.

Kendal Van Dyke
http://kendalvandyke.blogspot.com/[/url]

GabyYYZ SSCertifiable Points: 7913 More actions · Answer 10

Neil Franken (3/1/2010)
Hi There
Cool article but here is what I have found with newid().
...
Right lets say the profile(salary) matches 1,5 million rows in the database. SQL server will return all 1,5 million rows then ORDER BY. Now I mentioned that a GUID is very very random. This causes a high cpu load on the server as the poor Server now has to first sort the 1,5 million rows then it can return the top 1000. Think about it. It first has to sort before it can return. I have tested this and it does not matter if I return 1 row or 750 000 out of the 1,5 million rows that matches the query it constantly runs at the same speed. The top can only be applied once the sorting is done. Granted for small tables and non mission critical queries this technique can work well I would not use it on large tables as you will create a bottle neck.
...

One option, especially if you have an indexed identity column on your source table, is to generate a separate table of random row numbers, create a clustered index on it, and join with the original table.

create table #lookup_table(row_num int)

declare @ctr int, @samplesize int

set @ctr = 0

set @samplesize = 1000 -- for example, a sample size of 1,000 is needed

while @ctr < @samplesize

BEGIN

insert into #lookup_table select abs(checksum(newid()))

set @ctr = @ctr + 1

END

create clustered index idxc on #lookup_table(row_num)

Do a join on this table and it should go much more quickly, so the entire original table would not be loaded. Not much chance there will be duplicate rows with this method as INT can be up to 2,147,483,647.

Gaby________________________________________________________________"In theory, theory and practice are the same. In practice, they are not." - Albert Einstein

Paul White SSC Guru Points: 150467 More actions · Answer 11

adish (3/1/2010)
I've been using GUID as PKs, but this is a novel way of using it. Great.

Not a clustered Primary Key, I trust?

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Paul White SSC Guru Points: 150467 More actions · Answer 12

GabyYYZ (3/1/2010)

One option, especially if you have an indexed identity column on your source table, is to generate a separate table of random row numbers, create a clustered index on it, and join with the original table.

Nice idea. Of course, the 'random' numbers are then a bit, er, 'fixed' aren't they?

Can't believe you used a RBAR method to populate your table. 😛

For smallish numbers of random rows, I prefer an approach very similar to the one posted by Gary earlier.

It does require a table with a sequential ID, but that's pretty common - excepting those that like GUIDs as a PK *shudder*

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi

Jonathan Kehayias One Orange Chip Points: 26778 More actions · Answer 13

Paul White (3/1/2010)
adish (3/1/2010)
I've been using GUID as PKs, but this is a novel way of using it. Great.
Not a clustered Primary Key, I trust?

It probably was clustered, its common for App Developers to do this kind of thing. It happened at Microsoft around the Windows 7 RC downloads...

http://www.sqlskills.com/BLOGS/PAUL/post/Why-did-the-Windows-7-RC-failure-happen.aspx

SQLBOT SSCrazy Eights Points: 8014 More actions · Answer 14

Paul White (3/1/2010)
adish (3/1/2010)
I've been using GUID as PKs, but this is a novel way of using it. Great.
Not a clustered Primary Key, I trust?

obviously the super-best data type for a PK is one that increments by 1 each time so the index is appended on each insert and there is no fragmentation,

Any other data type is just as likely to fragment as a guid. I left-handedly proved this in one of my shamefully RBAR riddled articles: http://www.sqlservercentral.com/articles/Indexing/64424/

It's the random insertion, not the datatype that causes the problem.

What's the differece if the data inserted is Johnson, Jonsonn, Johnsen or three guids?

Under the hood, there's not a difference.

The only way you're not going to frag your index at an equal rate as a guid is if the data is inserted in the same order as the clustered index key. In OLTP, that's pretty unlikely except with and identity (which has it's own set of issues).

~Craig O.

Craig Outcalt

Tips for new DBAs: http://www.sqlservercentral.com/articles/Career/64632
My other articles: http://www.sqlservercentral.com/Authors/Articles/Craig_Outcalt/560258