RE: Randomizing Result Sets with NEWID

SSC-Addicted

Points: 431

March 1, 2010 at 12:13 am

Hi There

Cool article but here is what I have found with newid().

Doing a order by newid() is a performance killer. Let me explain. We have a table of 36 million prospective customers. We send leads to sales agent centres daily. They want random data. Great so we have been using order by newid() for ages. The problem is that is it extremely slow. Painfully slow. Here is why. A GUID returned by a newid() operation is essentially a very random number. Keep this fact in mind. In my daily tasks we send various amounts of leads out of our system for different centres. Basically we have a query like this(simplified the select for readability).

SELECT TOP 1000 leadname,contactdetails -- The TOP is variable per call centre

FROM Prospects

WHERE Salary=>2500 -- we match our prospect profile here

Right lets say the profile(salary) matches 1,5 million rows in the database. SQL server will return all 1,5 million rows then ORDER BY. Now I mentioned that a GUID is very very random. This causes a high cpu load on the server as the poor Server now has to first sort the 1,5 million rows then it can return the top 1000. Think about it. It first has to sort before it can return. I have tested this and it does not matter if I return 1 row or 750 000 out of the 1,5 million rows that matches the query it constantly runs at the same speed. The top can only be applied once the sorting is done. Granted for small tables and non mission critical queries this technique can work well I would not use it on large tables as you will create a bottle neck.

For larger tables it might worth randomizing the data on insert and not having to worry about the randomization during extraction. By the way newid() on a table as the clustered key is not a good idea as the fragmentation of your tables will remain consistently high.

Hope that helps.

Regards