Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Why UniqueIdentifier is a Bad Choice for a Clustered Index in SQL Server

I had a question today about why it was “bad” to use a UniqueIdentifier as the data type for a clustered index in SQL Server. After all, on the surface that sounds like a good choice, since a UniqueIdentifier is guaranteed to be unique, right?  I have run into many developers over the years that are seduced by this idea…

Unfortunately, using a UniqueIdentifier is not a good choice from SQL Server’s perspective. UniqueIdentifier is a relatively large data type, so your table will be larger than if you can use a numeric data type such as Integer. Far worse is how quickly your clustered index will become fragmented as you INSERT even a relatively small number of rows into the table. This can hurt query performance for certain types of queries, and it makes your clustered index much larger.

In the example below, simply INSERTING 2000 rows of data into an empty table gives you over 99% fragmentation and nearly doubles the size of the table (compared to using an Integer for the clustered index).  Using a default value of NEWSEQUENTIALID() for the myID column alleviates the fragmentation issue, but you still have a clustered index that is about 30% larger than if you use an integer data type for the clustered index.

Note: The script below only works for SQL Server 2008 and above. There are only a couple of minor changes needed to make it work on SQL Server 2005.

-- Why UNIQUEIDENTIFIER is a bad choice for a clustered index data type
-- Glenn Berry 
-- March 2010
-- http://glennberrysqlperformance.spaces.live.com/
-- Twitter: GlennAlanBerry


-- Switch to Master
USE [master];
GO

-- Drop database if it exists
IF  EXISTS (SELECT name FROM sys.databases WHERE name = N'TestDB')
    BEGIN
        ALTER DATABASE [TestDB] SET  SINGLE_USER WITH ROLLBACK IMMEDIATE;
        DROP DATABASE [TestDB];
    END
GO


-- Create new database (change file paths if needed
CREATE DATABASE [TestDB] ON  PRIMARY 
( NAME = N'TestDB', FILENAME = N'C:\SQLData\TestDB.mdf' , 
  SIZE = 5MB , FILEGROWTH = 50MB )
 LOG ON 
( NAME = N'TestDB_log', FILENAME = N'C:\SQLLogs\TestDB.ldf' , 
  SIZE = 2MB , FILEGROWTH = 50MB)
GO


-- Switch to new database
USE TestDB;
GO


-- Do some things that will grow the database and fragment indexes

-- Create a "bad" table 
-- (never use a UNIQUEIDENTIFIER for your PK, clustered index!)
IF  EXISTS (SELECT * FROM sys.objects 
            WHERE object_id = OBJECT_ID(N'[dbo].[TestBadTable]') 
            AND type in (N'U'))
DROP TABLE [dbo].[TestBadTable];
GO
CREATE TABLE [dbo].[TestBadTable](
    [myID] [uniqueidentifier] NOT NULL,
    [myColumn] [varchar](2000) NULL,
 CONSTRAINT [PK_TestBadTable] PRIMARY KEY CLUSTERED 
(
    [myID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
 ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY];
GO



-- Insert some data to fragment the clustered table
SET NOCOUNT ON;
DECLARE @iteration INT = 0;

WHILE @iteration < 2000
BEGIN
    INSERT INTO dbo.TestBadTable(myID, myColumn)
    SELECT NEWID(), REPLICATE('A', 2000);
    
    SET @iteration = @iteration + 1;
END
GO
-- End of insert loop

-- Check Table size (7.9MB)
EXEC sp_spaceused TestBadTable, True;

-- Check how badly the clustered index is fragmented (99.3%)
SELECT avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'TestBadTable'), NULL , NULL, N'Limited') 
OPTION (MAXDOP 1);


-- Reorganize the clustered index (this is always an online operation)
ALTER INDEX [PK_TestBadTable] 
ON [dbo].[TestBadTable] REORGANIZE WITH ( LOB_COMPACTION = ON )
GO

-- Check how badly the clustered index is fragmented (2.4%)
SELECT avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'TestBadTable'), NULL , NULL, N'Limited') 
OPTION (MAXDOP 1);

-- Rebuild the clustered index (this can be done online with Enterprise Edition)
ALTER INDEX [PK_TestBadTable] ON [dbo].[TestBadTable] REBUILD PARTITION = ALL 
WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON, ONLINE = ON, SORT_IN_TEMPDB = OFF )
GO



-- Create another table using NEWSEQUENTIALID as a default
IF  EXISTS (SELECT * FROM sys.objects 
            WHERE object_id = OBJECT_ID(N'[dbo].[TestNotAsBadTable]') 
            AND type in (N'U'))
DROP TABLE [dbo].[TestNotAsBadTable];
GO
CREATE TABLE [dbo].[TestNotAsBadTable](
    [myID] [uniqueidentifier]  NOT NULL DEFAULT NEWSEQUENTIALID(),
    [myColumn] [varchar](2000) NULL,
 CONSTRAINT [PK_TestNotAsBadTable] PRIMARY KEY CLUSTERED 
(
    [myID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
 ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY];
GO


-- Insert some data to fragment the clustered table
SET NOCOUNT ON;
DECLARE @iteration INT = 0;

WHILE @iteration < 2000
BEGIN

    INSERT INTO dbo.TestNotAsBadTable(myColumn)
    SELECT REPLICATE('A', 2000);
    
    SET @iteration = @iteration + 1;
END
GO
-- End of insert loop

-- Check Table size (5.3MB)
EXEC sp_spaceused TestNotAsBadTable, True;

-- Check how badly the clustered index is fragmented (1%)
SELECT avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'TestNotAsBadTable'), NULL , NULL, N'Limited') 
OPTION (MAXDOP 1);


-- Reorganize the clustered index (this is always an online operation)
ALTER INDEX [PK_TestNotAsBadTable] 
ON [dbo].[TestNotAsBadTable] REORGANIZE WITH ( LOB_COMPACTION = ON )
GO

-- Check how badly the clustered index is fragmented (2.4%)
SELECT avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats 
(DB_ID(), OBJECT_ID(N'TestNotAsBadTable'), NULL , NULL, N'Limited') 
OPTION (MAXDOP 1);

-- Rebuild the clustered index (this can be done online with Enterprise Edition)
ALTER INDEX [PK_TestNotAsBadTable] ON [dbo].[TestNotAsBadTable] REBUILD PARTITION = ALL 
WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON, ONLINE = ON, SORT_IN_TEMPDB = OFF )
GO


-- Create a much better table
IF  EXISTS (SELECT * FROM sys.objects 
            WHERE object_id = OBJECT_ID(N'[dbo].[TestGoodTable]') 
            AND type in (N'U'))
DROP TABLE [dbo].[TestGoodTable];
GO
CREATE TABLE dbo.TestGoodTable
(MyID INT IDENTITY(1,1) NOT NULL,
 MyColumn VARCHAR(2000) NULL,
 CONSTRAINT PK_TestGoodTable PRIMARY KEY CLUSTERED 
(MyID ASC)
WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY];
GO


-- Insert some data to fragment the clustered table
SET NOCOUNT ON;
DECLARE @iteration INT = 0;

WHILE @iteration < 2000
BEGIN
    INSERT INTO dbo.TestGoodTable(MyColumn)
    SELECT REPLICATE('A', 2000);
    
    SET @iteration = @iteration + 1;
END
GO
-- End of insert loop

-- Check Table size (4MB)
EXEC sp_spaceused TestGoodTable, True;

-- Check how badly the clustered index is fragmented (less than 1%)
SELECT avg_fragmentation_in_percent, page_count
FROM sys.dm_db_index_physical_stats (DB_ID(), OBJECT_ID(N'TestGoodTable'), NULL , NULL, N'Limited') 
OPTION (MAXDOP 1);

-- Reorganize the clustered index (this is always an online operation)
ALTER INDEX [PK_TestGoodTable] 
ON [dbo].[TestGoodTable] REORGANIZE WITH ( LOB_COMPACTION = ON )
GO

-- Rebuild the clustered index (this can be done online with Enterprise Edition)
ALTER INDEX [PK_TestGoodTable] ON [dbo].[TestGoodTable] REBUILD PARTITION = ALL 
WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, 
ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, ONLINE = ON, SORT_IN_TEMPDB = OFF )
GO

Comments

Posted by Anonymous on 23 March 2010

Pingback from  Dew Drop &#8211; March 23, 2010 | Alvin Ashcraft&#039;s Morning Dew

Posted by Anonymous on 24 March 2010

Pingback from  Twitter Trackbacks for                 SQL Server Central, Why UniqueIdentifier is a Bad Choice for a Clustered Index in SQL Server - Glenn Berry's SQL Server Performance         [sqlservercentral.com]        on Topsy.com

Posted by Steve Jones on 24 March 2010

Great points, and glad to see this posted again to teach people the issue.

Posted by mills.brent on 30 March 2010

Another thing I don't hear mentioned very often is that newsequentialid() is not necessarily sequential after the server is rebooted.  We have run across this multiple times where the current guid range evaluates as in between 2 other guid ranges previously generated.

Posted by Jason Brimhall on 30 March 2010

Thanks for pointing this stuff out.  Having an article such as this helps to reinforce points I have made to my dev team.

Posted by ziangij on 30 March 2010

Thanks... i always use int datatype for clustered indexes... however got to know something new today...

Posted by john.david.birch on 30 March 2010

I am glad that the newsequentialid() is not entirely sequential. I have based a lottery system on its randomness.

If it was predicatable then there would be a lot of complaints.

Posted by ALZDBA on 30 March 2010

Good points toward the pro and cons.

btw newsequentialid ( on sql2005 ) will generate sequential unique identifiers for the current insert statement. So Splits will still occur because of the "randomness" of the generated guid.

Don't shoot me if this behaviour changed over the servicepacks or versions of sqlserver.

Posted by peter.row on 30 March 2010

It's a shame that this is the case because from object oriented development point of view it would be nice if you could generate a guid in code and pass that with the other data when saving to the database instead of having to use an output parameter to get the auto-generated ID from SQL Server.

Posted by richard.jereb on 30 March 2010

Thanx for your contribution...

The sexy idea behind a GUID is IMHO that you can create and refer the ID anywhere and anytime, like e.g. create a new object in memory, give the ID locally on the client and still you can be sure to have a unique ID when you persist later on.

The drawbacks outweigh though, when you consider anything bigger than a prototype with a few dozens of rows.

NEWSEQUENTIALID() looks like a solution at the first glance - but this is database and server specific solution again... I see absolutely no advantage against an INT or BIGINT with IDENTITY.

There is a possible workaround described in

www.informit.com/.../article.aspx

by combining a GUID and the current timestamp which works quite well - you can also generate your IDs on the client side with this technique.

But take care - chances for duplicates are higher as a part of the GUID is just turncated there. Also entries within the same millisecond are not sequential. And lastly - the method utilizes a proprietary behaviour of SQL Server as described in

msdn.microsoft.com/.../system.data.sqltypes.sqlguid(VS.71).aspx

sqlblog.com/.../how-are-guids-sorted-by-sql-server.aspx

I would not bet my two cents that it will always stay the same way sorting by the last 6 bytes of a binary :-)

So at the and there is still somehow a kind of bad gut feeling when it comes to GUID discussions with developer...

Posted by richard.jereb on 30 March 2010

Just in case you do not want to click through the links I added above: the combined GUID is basically just 10 bytes of a GUID with the last 6 bytes replaced by a timestamp...

cast(cast(NewID() as binary(10)) + cast(GetDate() as binary(6)) as uniqueidentifier)

Posted by andrea torre on 30 March 2010

Thank you for the article,

does it make sense to use a uniqueidentifier column as primary key, non clustered, and a clustered index on another column, for example a Datetime column, assuming that the select queryies are ordered by the Datetime column?

Thank you

Posted by dbowlin on 30 March 2010

At one point in time I too used a UniqueIdentifier for a clustered index.  I quickly saw the error of my ways for most of the reasons above.  

I am curious however on everyone's opinion on when to use a UniqueIdentifier.

Thanks

Posted by Daniel van Wyk on 30 March 2010

I've heard that using a uniqueidentifier as PK helps a lot with replication.  Is that true?  I guess if you only do one-way replication it might not be such an issue.

Posted by taiwoa on 30 March 2010

uniqueidentifier columns filled with NEWID() values are expedient under certain situations (if one doesn't want to deal with the management complexity of "rolling your own").

As noted in SQL Server 2008 Books Online - ms-help://MS.SQLCC.v10/MS.SQLSVR.v10.en/s10de_1devconc/html/345e371c-6d93-421c-909c-4e0f5d490764.htm - "...for example, in a customer billing system with data located in various company subsidiaries around the world. When the data is merged into the central site for consolidation and reporting, using globally unique values prevents customers in different countries/regions from having the same billing number or customer ID". I agree that one wouldn't want to use GUIDs as billing numbers or customer IDs but similar situations do exist...

For instance, I have phone and address tables that store phones and addresses, respectively, for  by different entities (employee, customer, employer, subscriber, applicant, etc.) The ID of the phone or address owner needs to be globally unique; hence, for expediency, each entity that needs to store data in the phone and address tables must have a primary key (or unique constraint column) of type uniqueidentifier, filled with NEWID(). This is the situation even if the database is used in a single installation. Now consider that the database must be used in several single installations that must replicate data to a central host database.

The alternative, of course, is to roll your own unique ID generator composed of:

1. Location ID and/or user ID

2. Table name/ID

3. An auto-incrementing number generator for the table (perhaps a column with an IDENTITY property)

The custom solution would still result in clustered index fragmentation at the host database but perhaps not as severe as using uniqueidentifier columns filled with NEWID(). However, the custom unique identifier solution comes with its own management complexity.

Clustered index fragmentation is a trade-off that one must live with to accommodate these perhaps uncommon situations.

- Taiwo

Posted by ilya.margolin on 30 March 2010

It is an everlasting battle between key datatypes. Two comments: First, in my performance tests back in 2000 days I saw no significant difference when self joining 700,000 record table on a uniqueidentifier vs. integer. Second, is logistical issue of migrating data between prod/dev/qa/ua  environments. With integers there will be a key clash at some point which leads to a costly reconciliation. That is the latter that forced me to switch to uniqueidentifiers.

Posted by David.Poole on 30 March 2010

Sometimes a GUID is a necessary evil, particularly if you have data streaming in from multiple sites into a single location.

The other alternative is to have a compound primary key where the site specific PK is a BIGINT but the record also supplies a SiteID value to indicate where it came from.

The problem with any form of integer is that you will eventually run out of values.  You have to build up monitoring systems to check, in advance, if you are going to hit the limit.

Don't cluster the GUIDs.

Where we use GUIDs tends to be in high volume persistence tables so we put the clustered key on a DATETIME field holding the CURRENT_TIMESTAMP for the parent record.  You still get some fragmentation but nothing that can't be handled easily.

Posted by jparker2 on 30 March 2010

IMHO it never makes sense to use a GUID to index or identify any set of objects with less than 9,223,372,036,854,775,807 items. This huge number is the Maximum Positive value of a BIGINT.  Think about it for a minute.  Does .Net or Java create arrays using a GUID or Integer as the Index Id of objects inside of them? Why would you want to do something that even the architects of your development language think is a bad idea? Does writing code that is contrary to the memory and architecture of the platform you are developing against make sense?

Posted by roger.plowman on 30 March 2010

I ran into a situation that required a GUID as a record ID because the application would be distributed and authors needed to create records with unique ID's across a random collection of systems potentially anywhere in the world.

My solution, use a GUID only as an external reference, and create a table of GUIDs in the local database with a *LOCAL* integer identity. Use the identity as the primary key for everything and the GUID as a unique non-clustered key.

Then the only time the GUID had to be dealt with was new record creation, importing records and exporting records.

Worked really well. I got the convenience, performance, and fragmentation resistance of the identity and the globally guaranteed uniqueness of the GUID.

Win win. :)

Posted by Jason Shadonix on 30 March 2010

Question - If you have a linking table that has a GUID primary key, and only two other columns that are GUIDs serving as foreign keys:  Is it better to make all three of them nonclustered indexes, or go ahead and cluster on the one most joined/searched on?

Posted by Brice Cave on 30 March 2010

When I started thinking about this problem, the first solution that came to mind was similar to what Roger mentioned.  

For systems that have multiple ptentially detached clients that will 'sync' at a later time, there needs to be something other than an integer to ensure uniqueness across installations.

With Roger's solution your sync logic may be a little extra work, but definitely worth the effort.  Syncronization could be handled via messaging (or a bulk sync process) that carried the uniqueid from the separate table.

You might also use a compound key of installation key/site key (or user key) and an identity IF you tracked the additional information.  You could get around the uniqueid above, but you may also have some additional constraints to work with Ex:  What if a user logged in at multiple stations?  Could there possibly be a user/identity combination that would clash?

There is also the 'COMB' unique id method described in the Nilsson article referenced above.  I like this idea, too.

For those of you who are uniqueid detractors...

Are there other methods than what I have noted above that you have used to ensure integrity between disconnected systems (using replication or another sync process)?

-Brice

Posted by Steven Willis on 30 March 2010

In a case where I really needed to use a guid as a primary key, I have used Jimmy Nilsson's method mentioned above.

S E L E C T (CONVERT([uniqueidentifier],CONVERT([binary](10),newid(),0)+CONVERT([binary](6),getdate(),0),0)) AS [guid]

Here's links to Nilsson's original article and another blog post that discusses it.

www.informit.com/.../article.aspx

chillicode.wordpress.com/.../using-int-or-guid-for-primary-keys-or-comb-guid

Posted by SQLBOT on 30 March 2010

I've pointed out several times that unless you know your clustered index is inserted sequentially (such as the case of an identity, which has its own set of problems) you end up with this same problem and at the same scale.  

The only thing that might make a guid CU split more or less than another random insert-based CU is that a guid is guaranteed to be 16 bytes possibly taking up more (or possibly less) than another key taking up more (or less) space in the data page.

Nobody jumps on natural keys like they do the poor, homely GUID...  

My short point is that nothing is perfect.

Neither the GUID Clustered Index, or the complete avoidance thereof based on cherry-picked comparisons.

Posted by kevincox on 30 March 2010

I've seen projects where a GUID PK is required/preferred.  Very few of those need to have the GUID as a clustered index.  But there is one situation where the GUID clustered index actually helped.  It is where there were so many inserts coming from so many connections that creating a clustered index on the GUID took away the hot spot.  Of course, it meant an occasional index rebuild using a 50% or 60% fill factor to keep it from doing too many page splits.

Posted by DLathrop on 30 March 2010

The GUID/UUID mechanism was developed a long time ago for distributed systems, generally has OS and programming languguage support across platforms. Also, for what it does, the GUID is about as small as it can be (16 bits) and still work universally. So rather than come up with some new mouse trap, MS just brought GUIDs into SQL Server as T-SQL "uniqueidentifiers".

You could use a "site-id + IDENTITY()" combination, but then you have to uniquely assign site-ids to each machine that might possibly generate a new record. The GUID essentialy takes care of that for you. Many examples I've seen for using GUIDs involve arbitrary workstations, PDAs or even cell phones generating and storing records until a batch of records are submitted to the central database.

I would be careful about Jimmy Nilsson's method of "combined GUIDs". The repeating "4" digit is because Windows 2000 uses format 4 UUIDs (which are not based on MAC address and date); the rest of the number is supposed to be psuedo-random but you never know which parts may be "more random" than others. I prefer to keep the GUID as a secondary index (with an inflated index padding factor to lower the number of splits), and introduce a surrogate key with a clusterd index as the table's primary key.

------

For john.david.birch and others: GUIDs are unique, but not random. They should not be used for cryptographic purposes, or for generating lottery numbers. In particular, Windows generated GUIDs actually have predictable sequences.

Posted by philip_soeborg on 31 March 2010

Hi, Great article. Do you have any examples of "This can hurt query performance for certain types of queries"? Thanks

Posted by Ron Kunce on 2 April 2010

I'd have to say that the clustered index size is important and using a GUID is not the way to go but also, I do not like using an integer (identity) field either.  For me, the clustered index default sort order should be something meaningful and I prefer using a composite key value such as LastName + FirstName + identityValue (the last used in case of duplicate names) for my clustered index.  Though this may produce a larger index size and consequently slower performance for most tables of a relatively small size it provides a meaningful sort order.  With larger tables however speed becomes tantamount and the identity integer is  the better choice.

To answer Daniel van Wyk's post, it is my understanding that a uniqueidentifier (GUID) field is necessary for replication however it does not have to the table's primary key.

Posted by Chris Plowman on 16 June 2010

I am not sure why you would want to use a GUID as your clustered index.  There are certainly issues where it is helpful to use a GUID for a primary key, but there is no reason it needs to also be a clustered index.  For that matter, I don't think it makes much sense to use an identity column either since in most cases I've seen queries will be performed off real data rather than an arbitrary identifier.  Performance-wise I think you are almost always better off clustering off a natural key.

Posted by Rob Schripsema on 4 October 2010

So, we don't create a clustered index on the GUID, which is the primary key for most of our tables. (This is because we regularly merge data from multiple SQL instances into a single, master instance, and there are a lot of relations between tables that need to be maintained during that merge -- and we don't want to have to write all sorts of FK conversion scripts during the import.)

Is there any disadvantage to leaving the table with NO clustered index and letting SQL assign the RID, which I believe will dictate the physical order of the data and become the actual lookup value used by other indexes? Is that any more or less efficient than adding an additional identity field and creating a clustered index on that?

Posted by Glenn Berry on 5 October 2010

Rob,

Having a lot of "heap" tables (with no clustered index) is nor a good idea in my opinion.

I understand why you are using GUIDs as a Primary Key, and it makes sense in your situation. If it were me, I would add an int or bigint artificial key identity column and put the clustered index on that.

Leave a Comment

Please register or log in to leave a comment.