Randomizing Result Sets with NEWID

  • seth delconte

    SSCertifiable

    Points: 6388

    Comments posted to this topic are about the item Randomizing Result Sets with NEWID

    _________________________________
    seth delconte
    http://sqlkeys.com

  • Norman Rasmussen

    Ten Centuries

    Points: 1055

    For the purpose of the article, you probably want to select all customers with order_count >= min(order_count of top 10). That way you don't exclude customers that sort 'later' and might otherwise get excluded.

  • Neil Franken

    SSC-Addicted

    Points: 431

    Hi There

    Cool article but here is what I have found with newid().

    Doing a order by newid() is a performance killer. Let me explain. We have a table of 36 million prospective customers. We send leads to sales agent centres daily. They want random data. Great so we have been using order by newid() for ages. The problem is that is it extremely slow. Painfully slow. Here is why. A GUID returned by a newid() operation is essentially a very random number. Keep this fact in mind. In my daily tasks we send various amounts of leads out of our system for different centres. Basically we have a query like this(simplified the select for readability).

    SELECT TOP 1000 leadname,contactdetails -- The TOP is variable per call centre

    FROM Prospects

    WHERE Salary=>2500 -- we match our prospect profile here

    Right lets say the profile(salary) matches 1,5 million rows in the database. SQL server will return all 1,5 million rows then ORDER BY. Now I mentioned that a GUID is very very random. This causes a high cpu load on the server as the poor Server now has to first sort the 1,5 million rows then it can return the top 1000. Think about it. It first has to sort before it can return. I have tested this and it does not matter if I return 1 row or 750 000 out of the 1,5 million rows that matches the query it constantly runs at the same speed. The top can only be applied once the sorting is done. Granted for small tables and non mission critical queries this technique can work well I would not use it on large tables as you will create a bottle neck.

    For larger tables it might worth randomizing the data on insert and not having to worry about the randomization during extraction. By the way newid() on a table as the clustered key is not a good idea as the fragmentation of your tables will remain consistently high.

    Hope that helps.

    Regards

  • adish

    SSChasing Mays

    Points: 602

    I've been using GUID as PKs, but this is a novel way of using it. Great.

  • gary-915906

    Valued Member

    Points: 74

    Id agree - newid() is a performance killer.

    its much faster to do something like this...

    DECLARE @RandomNumber float

    DECLARE @RandomInteger int

    DECLARE @MaxValue int

    DECLARE @MinValue int

    SELECT @MinValue = MIN(Id),

    @MaxValue = MAX(Id)

    FROM dbo.SomeTable

    SELECT @RandomNumber = RAND()

    SELECT @RandomInteger = ((@MaxValue + 1) - @MinValue) * @RandomNumber + @MinValue

    SELECT TOP (1) *

    FROM dbo.SomeTable

    WHERE Id >= @RandomInteger

  • DannyS

    SSC Veteran

    Points: 231

    GUID and SQL Server function Newid() create globally unique identifiers. This is not they same as random and likely does not have very good random properties.

    One digit (16 bits) is used to id the algorithm. I'm sure a good many bits represent the time of generation.

    If your using SQL Server 2008, then CRYPT_GEN_RANDOM(n) (n= number of digits), creates a cryptographically secure pseudo random number. These are usually the best available without a true hardware random number generator and they do execute for each row, unlike rand()

    If your on an earlier version, but have 2008 available consider a linked query to obtain the numbers, or write a custom CLR ( see System.Security.Cryptography.RNGCryptoServiceProvider. )

  • David.Poole

    SSC Guru

    Points: 75191

    I've done quite a bit of testing of NEWID() and randomness and it is a very effective random number generator.

    As the GUID comprises of blocks of hexadecimal these blocks can be converted into integer values.

    I took a customer file where the PK was a GUID and based a partitioning scheme on a converted integer value and modulo 16 and it ended up with near ideal partition distribution.

    Try the following

    DECLARE

    @test char(4),

    @MyInt INT,

    @SQL NVARCHAR(200),

    @ParmDefinition NVARCHAR(200)

    SET @test=LEFT(NEWID(),4)

    SET @SQL='SET @MyInt=CONVERT(INT,0x'+@Test+')'

    SET @ParmDefinition=N'@MyInt int OUTPUT'

    EXEC sp_Executesql

    @SQL,

    @ParmDefinition,

    @MyInt = @MyInt OUTPUT

    SELECT @test,@MyInt

  • DannyS

    SSC Veteran

    Points: 231

    The values generated have some nice properties, but aren't random.

    Converting the first 8 digits to an integer, hex2dec(left([guid],8)), the numbers have definate correlation with the previous generated value, and the 16th previous value, even when the generated ID are generated by deecidely unpredictable times (visitors to a website taking a particular action)

    In the prize example, every 16th person might have a 10% or more greater chance than the others.

  • Martin Vrieze

    SSCrazy

    Points: 2760

    I like the concept of using a guid as a pseudo random number. Very novel approach!!!

    I've always used auxillary tables in tempdb combined with forward-only cursors to assign unique, random numbers to each record for setting up direct marketing test panels...I generally do not use cursors but this was quite fast for my needs and direct marketing / database marketing queries are mostly ad-hoc in nature anyhow and my systems have not had to worry about performance hits like a production OLTP system would.

    This alternate approach you outline IS going to get tested.

    Well Done!!!

  • Kendal Van Dyke

    SSCertifiable

    Points: 5962

    Using NEWID to do a random sort or grab a random number of rows from a result set is a HUGE performance killer and does not scale well. I've had developers slip this kind of stuff into production and in less than a minute the CPUs were pegged at 100%.

    Wile this method works, I do not recommend it on anything beyond one time ad-hoc DBA queries or infrequently used applications. You can sort randomly much more efficiently using RAND. Rather than type a lengthy explanation here of how I will submit an article.

    Kendal Van Dyke
    http://kendalvandyke.blogspot.com/[/url]

  • GabyYYZ

    SSCertifiable

    Points: 7913

    Neil Franken (3/1/2010)


    Hi There

    Cool article but here is what I have found with newid().

    ...

    Right lets say the profile(salary) matches 1,5 million rows in the database. SQL server will return all 1,5 million rows then ORDER BY. Now I mentioned that a GUID is very very random. This causes a high cpu load on the server as the poor Server now has to first sort the 1,5 million rows then it can return the top 1000. Think about it. It first has to sort before it can return. I have tested this and it does not matter if I return 1 row or 750 000 out of the 1,5 million rows that matches the query it constantly runs at the same speed. The top can only be applied once the sorting is done. Granted for small tables and non mission critical queries this technique can work well I would not use it on large tables as you will create a bottle neck.

    ...

    One option, especially if you have an indexed identity column on your source table, is to generate a separate table of random row numbers, create a clustered index on it, and join with the original table.

    create table #lookup_table(row_num int)

    declare @ctr int, @samplesize int

    set @ctr = 0

    set @samplesize = 1000 -- for example, a sample size of 1,000 is needed

    while @ctr < @samplesize

    BEGIN

    insert into #lookup_table select abs(checksum(newid()))

    set @ctr = @ctr + 1

    END

    create clustered index idxc on #lookup_table(row_num)

    Do a join on this table and it should go much more quickly, so the entire original table would not be loaded. Not much chance there will be duplicate rows with this method as INT can be up to 2,147,483,647.

    Gaby
    ________________________________________________________________
    "In theory, theory and practice are the same. In practice, they are not."
    - Albert Einstein

  • Paul White

    SSC Guru

    Points: 150442

    adish (3/1/2010)


    I've been using GUID as PKs, but this is a novel way of using it. Great.

    Not a clustered Primary Key, I trust?

  • Paul White

    SSC Guru

    Points: 150442

    GabyYYZ (3/1/2010)


    One option, especially if you have an indexed identity column on your source table, is to generate a separate table of random row numbers, create a clustered index on it, and join with the original table.

    Nice idea. Of course, the 'random' numbers are then a bit, er, 'fixed' aren't they?

    Can't believe you used a RBAR method to populate your table. 😛

    For smallish numbers of random rows, I prefer an approach very similar to the one posted by Gary earlier.

    It does require a table with a sequential ID, but that's pretty common - excepting those that like GUIDs as a PK *shudder*

  • Jonathan Kehayias

    One Orange Chip

    Points: 26672

    Paul White (3/1/2010)


    adish (3/1/2010)


    I've been using GUID as PKs, but this is a novel way of using it. Great.

    Not a clustered Primary Key, I trust?

    It probably was clustered, its common for App Developers to do this kind of thing. It happened at Microsoft around the Windows 7 RC downloads...

    http://www.sqlskills.com/BLOGS/PAUL/post/Why-did-the-Windows-7-RC-failure-happen.aspx

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

  • SQLBOT

    SSCrazy Eights

    Points: 8014

    Paul White (3/1/2010)


    adish (3/1/2010)


    I've been using GUID as PKs, but this is a novel way of using it. Great.

    Not a clustered Primary Key, I trust?

    obviously the super-best data type for a PK is one that increments by 1 each time so the index is appended on each insert and there is no fragmentation,

    Any other data type is just as likely to fragment as a guid. I left-handedly proved this in one of my shamefully RBAR riddled articles: http://www.sqlservercentral.com/articles/Indexing/64424/

    It's the random insertion, not the datatype that causes the problem.

    What's the differece if the data inserted is Johnson, Jonsonn, Johnsen or three guids?

    Under the hood, there's not a difference.

    The only way you're not going to frag your index at an equal rate as a guid is if the data is inserted in the same order as the clustered index key. In OLTP, that's pretty unlikely except with and identity (which has it's own set of issues).

    ~Craig O.

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply