Auto generated SQL Server keys with the uniqueidentifier or IDENTITY

By:   |   Comments (38)   |   Related: 1 | 2 | 3 | 4 | > Identities


Problem

I'm designing a table and I've decided to create an auto-generated primary key value as opposed to creating my own scheme or using natural keys. I see that SQL Server offers globally unique identifiers (GUIDs) as well as identities to create these values. What are the pros and cons of these approaches?

Solution

Yes, there are a number of ways you can auto-generate key values for your tables. The most common ways are via the use of the IDENTITY column property or by specifying a uniqueidentifier (GUID) data type along with defaulting with either the NEWID() or NEWSEQUENTIALID() function. Furthermore, GUIDs are heavily used in SQL Server Replication to uniquely identify rows in Merge Replication or Transactional Replication with updating subscriptions.

The most common, well known way to auto-generate a key value is via the use of the IDENTITY column property on a column that's typically declared as an integer. Once defined, the engine will automatically generate a sequential number based on the way the property has been declared on the column. The IDENTITY property takes an initial seed value as its first parameter and an increment value as its second parameter.

Consider the following example which creates and inserts into identity based tables that define the primary key as a clustered index:

SET NOCOUNT ON
GO
USE MASTER
GO
CREATE DATABASE MSSQLTIPS
GO

USE MSSQLTIPS
GO
-- Start at 1 and increment by 1
CREATE TABLE IDENTITY_TEST1 
(
ID INT IDENTITY(1,1) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- Start at 10 and increment by 10
CREATE TABLE IDENTITY_TEST2
(
ID INT IDENTITY(10,10) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- Start at 1000 and increment by 5
CREATE TABLE IDENTITY_TEST3
(
ID INT IDENTITY(1000,5) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- INSERT 1000 ROWS INTO EACH TEST TABLE 
DECLARE @COUNTER INT
SET @COUNTER = 1

WHILE (@COUNTER <= 1000)
BEGIN
   INSERT INTO IDENTITY_TEST1 DEFAULT VALUES 
   INSERT INTO IDENTITY_TEST2 DEFAULT VALUES 
   INSERT INTO IDENTITY_TEST3 DEFAULT VALUES 
   SET @COUNTER = @COUNTER + 1
END
GO

SELECT TOP 3 ID FROM IDENTITY_TEST1
SELECT TOP 3 ID FROM IDENTITY_TEST2
SELECT TOP 3 ID FROM IDENTITY_TEST3
GO

result sets

Another way to auto-generate key values is to specify your column as a type of uniqueidentifier and DEFAULT using NEWID() or NEWSEQUENTIALID(). Unlike IDENTITY, a DEFAULT constraint must be used to assign a GUID value to the column.

How do NEWID() and NEWSEQUENTIALID() differ? NEWID() randomly generates a guaranteed unique value based on the identification number of the server's network card plus a unique number from the CPU clock. In contrast, NEWSEQUENTIALID() generates these values in sequential order as opposed to randomly.

Let's create new tables that use a uniqueidentifier along with both NEWID() and NEWSEQUENTIALID()

USE MSSQLTIPS
GO

CREATE TABLE NEWID_TEST
(
ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO
CREATE TABLE NEWSEQUENTIALID_TEST
(
ID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- INSERT 1000 ROWS INTO EACH TEST TABLE 
DECLARE @COUNTER INT
SET @COUNTER = 1

WHILE (@COUNTER <= 1000)
BEGIN
   INSERT INTO NEWID_TEST DEFAULT VALUES 
   INSERT INTO NEWSEQUENTIALID_TEST DEFAULT VALUES 
   SET @COUNTER = @COUNTER + 1
END
GO

SELECT TOP 3 ID FROM NEWID_TEST
SELECT TOP 3 ID FROM NEWSEQUENTIALID_TEST
GO

UNIQUEIDENTIFIER values

As you can see, the first table which uses NEWID() generates random values while the second table that uses NEWSEQUENTIALID() generates sequential values. As opposed to the integers generated by the IDENTITY approach, the GUID values generated are not as friendly to look at or work with. There is one other item to note. SQL Server keeps the last generated identity value in memory which can be retrieved right after an INSERT using SCOPE_IDENTITY(), @@IDENTITY, or CHECK_IDENT (depending on the scope you require). There is nothing similar to capture the last generated GUID value. If you use a GUID, you'll have to create your own mechanism to capture the last inserted value (i.e. retrieve the GUID prior to insertion or use the SQL Server 2005 OUTPUT clause).

Now that we understand how to auto generate key values and what they look like, let's examine the storage impacts of each approach. As part of the previously created table definitions, I added a column of CHAR(2000) to mimic the storage of additional column data. Let's examine the physical storage of the data:

USE MSSQLTIPS
GO
SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count
FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null)
ORDER BY tablename
GO

fragmentation data

Looking at this output, you can see that the NEWID() test table is very fragmented as evidenced by its fragmentation percentage of 98%. Furthermore, you can see that the rows were dispersed among 490 pages. This is due to the page splitting that occurred due to the random nature of the key generation. In contrast, the IDENTITY and NEWSEQUENTIALID() test tables show minimal fragmentation since their auto generated keys occur in sequential order. As a result, they don't suffer from the page splitting condition that plagues the NEWID() approach. Though you can defragment the NEWID() table, the random nature of the key generation will still cause page splitting and fragmentation with all future table INSERTs. However, page splitting can be minimized by specifying an appropriate FILL FACTOR.

Looking at the NEWSEQUENTIALID() test table, we see it generated fewer pages than the NEWID() approach but it still generated more pages than the IDENTITY approach. Why is this? It's because the uniqueidentifier data type consumes 16 bytes of disk space as opposed to the 4 bytes used by the integer data type that was used for the IDENTITY. Considering that SQL Server pages are generally capped at 8K or roughly 8060 bytes (as of SQL Server 2005, there is a row-overflow mechanism that can kick in but that's for another discussion), this leads to more pages generated for the NEWSEQUENTIALID() approach as opposed to the IDENTITY approach.

Examining the database table space used, we see that the tables using the IDENTITY approach used the least amount disk space.

exec sp_spaceused IDENTITY_TEST1
GO
exec sp_spaceused IDENTITY_TEST2
GO
exec sp_spaceused IDENTITY_TEST3
GO
exec sp_spaceused NEWID_TEST
GO
exec sp_spaceused NEWSEQUENTIALID_TEST
GO

space used per table

Now also consider this, since a uniqueidentifier data type consumes 16 bytes of data, the size of any defined non-clustered indexes on a table using a GUID as a clustered index are also affected because the leaf level of these non-clustered indexes contains the clustered index key as a pointer. As a result, the size of any non-clustered indexes would end up being larger than if an IDENTITY were defined as integer or bigint.

It's evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:

  • IDENTITY generated values are configurable, easy to read, and easier to work with
  • Fewer database pages are required to satisfy query requests
  • In comparison to NEWID(), page splitting (and its associated overhead) is eliminated
  • Database size is minimized
  • System functions exist to capture the last generated IDENTITY value (i.e. SCOPE_IDENTITY(), etc)
  • Some system functions - such as MIN() and MAX(), for instance - cannot be used on uniqueidentifier columns
Next Steps
  • Read more about NEWSEQUENTIALID() in the SQL Server 2005 Books Online
  • Read Using uniqueidentifier Data in the SQL Server 2005 Books Online
  • If you're not in a situation where you require a globally unique value, consider if an IDENTITY makes sense for auto-generating your key values.
  • Regardless if you decide on a GUID or IDENTITY, consider adding a meaningful UNIQUE key based on the real data in your table.


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Armando Prato Armando Prato has close to 30 years of industry experience and has been working with SQL Server since version 6.5.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Tuesday, October 1, 2019 - 10:49:56 PM - Jim Evans Back To Top (82644)

Great article. Nice details comparing identity to guids. Often I have dealt with very large tables heavily fragmented because of guid pkeys, when an int column with identity would have worked just fine.  Also, for those worried about running out of numbers they can use bigint with identity and hold 9,223,372,036,854,775,807 rows.  Another trick with int or bigint is to start the seed at the largest negitive number. Thanks for the info.


Thursday, February 23, 2017 - 10:51:35 PM - Asdirs Back To Top (46632)

Custom Generated ID with SQL Server

how to create a function id number in sql server? id primary example I want my customers table reading area code like this: Area Codes 12345 and Consumer Code 00001 , if combined into 12345-00001, 12345-00002, 12345-00003 and so on, then the area code 12344-00001, 12344-00002, 12344-00003 and so on, so the consumer code will increase following the area code. I have a table like this:

table customer (id, name, kode_area, idcustomer) and table area (id, name)
table customer (00001, A, 12345, 12345-00001) and table area (12345, BandarLampung-Indonesia)
table customer (00002, B, 12345, 12345-00002) and table area (12345, BandarLampung-Indonesia)
table customer (00002, B, 12345, 12345-00003) and table area (12345, BandarLampung-Indonesia)
table customer (00001, AA, 12344, 12344-00001) and table area (12344, Tanggamus-Indonesia)
table customer (00002, BB, 12344, 12344-00002) and table area (12344, Tanggamus-Indonesia)
table customer (00002, BC, 12344, 12344-00003) and table area (12344, Tanggamus-Indonesia)

Monday, February 6, 2017 - 12:36:12 PM - Michael Ryan Back To Top (46055)

Wow, what a fantastic article and analysis of UUID generation, thanks!

We have been pondering going to a UUID based primary key scheme for one of systems (we are currently on an INT based PK) and this really helps illustrate the difference.

I ran this test on a local MSSQLExpress 2014 box and the results are very similar but with the Sequential UUID showing less fragmentation than the traditional Identity model.

Interesting stuff...thanks!

 

IDENTITY_TEST1 2.8 38 250

IDENTITY_TEST2 2.8 38 250

IDENTITY_TEST3 2.8 38 250

NEWID_TEST 97.9959919839679 498 499

NEWSEQUENTIALID_TEST 1.49700598802395 47 334

 

 

 


Thursday, August 21, 2014 - 9:38:44 AM - Mike Back To Top (34220)

I find this a very simplified look at generating keys.

 

You don't know about replication?

 

I do like SQL Server 2012's ability to create your own unique pattern also.

 


Friday, April 12, 2013 - 3:12:32 PM - george hardy Back To Top (23336)

so it looks like i can still maintain my readable index numbers that are auto-incrementing integers, just remove it as the primary key, and begin to use these guids as the real row key.  this way, my code doing something like "DELETE FROM EQUIPMENT WHERE EQUIPMENT_ID = x" will still work.

i also want a much cleaner way to utilize the sync framework, and had i known this before, that task would have been MUCH easier!

 

thanks for the write-up.


Monday, January 14, 2013 - 2:26:26 AM - URVISH SUTHAR Back To Top (21434)

Great job, many thanks for sharing :)

 


Wednesday, May 23, 2012 - 11:26:06 PM - Ravi Chauhan Back To Top (17628)

how to delete record from table where primary key(ID) is auto generated, and we need to delete that record according to primary example

DELETE TOP(1) FROM dbo.Test WHERE ID = 1

but we dont know the ID no in above scenario....


Wednesday, May 23, 2012 - 2:26:09 AM - Ravi Chauhan Back To Top (17600)

We create example of two tables,

but suppose we have three table IDENTITY_TEST1, IDENTITY_TEST2 and IDENTITY_TEST3

Means A >> B >> C (link sign ">>")

so how to create the relationship in these three tables?

 


Wednesday, May 23, 2012 - 2:17:38 AM - Ravi Chauhan Back To Top (17599)

create database db

use db

CREATE TABLE IDENTITY_TEST1
(
ID INT IDENTITY(1,1) PRIMARY KEY,
TESTCOLUMN varchar(20)
)

CREATE TABLE IDENTITY_TEST2
(
EMP_ID INT IDENTITY(1,1) REFERENCES IDENTITY_TEST1(ID),
ADDR VARCHAR(20)
)

insert into IDENTITY_TEST1 values('cde')

select * from IDENTITY_TEST1

insert into IDENTITY_TEST2 values('acd')

select * from IDENTITY_TEST2


Thursday, April 19, 2012 - 6:00:38 PM - Enigmatic Back To Top (17010)

@Javed,

Yes that is correct. IDENTITY is a counter and every time you insert a record the next number in that counter is used. This is done to ensure you ALWAYS get a unique number that could not possibly be in the table. If you want to go back and "fill in" missing numbers then you are going to have to implement your own locking mechanism to first lock the table, then select the first missing number and then unlock it once you have found it. If you do not do this, there is a chance multiple calls made at the same time could potentially return the same number and attempt to both insert the same key which would give you a primary key violation.

If you want to delete all records and restart your numbering at 1 you need to use:

DBCC CHECKIDENT ("<Schema>.<TableName>")

This will reset the IDENTITY back to its original value


Thursday, April 19, 2012 - 9:11:55 AM - javed Back To Top (17000)

Hi. sir,

My questions is I have a column id and set as primary key identity in a table.whenever I delete any record from table and insert new record then column id start count as where last id number was exists.And  whenvere delete all record from table. then insrt new record then does not start with 1.whereas start with last id was there.

plz help me..do reply me.. 


Monday, April 2, 2012 - 1:37:32 AM - Enigmatic Back To Top (16725)

One thing that wasn't listed were the disadvantages of using IDENTITY columns.

The biggest issue I have is the use of index values in code. Where a function accepts an integer as the index, a person can place any number they like in there, regardless of what that number respresents. It opens you up to the possibility of more errors as every table with an identity will have a row with ID = 1, 2, 3, etc.

So for instance, there would be nothing stopping me from accidentally transposing an ID from one table into an ID of another table, and in fact I could create foreign key indexes between unrelated data simply because the ID values of one table exist in the other.

To me this is a loss of referencial integrity, and the signature of C# methods that require IDs all end up something like (int, int, int). To me this isn't IDentifying records at all, its imply numbers.

For this reason I prefer to use GUIDs as you are guaranteed that the GUID value in one table will not appear in any other table other than the one it belongs in, thus ensuring no mistakes in foreign keys, and no accidental transposing when filling out method signatures. When you consider how many functions, stored procedures, methods, and queries either pass identifiers or compare identifiers, the odds of transposing increase with every single new addition.

This level of security and reliability easily "trumps" a bit of disk space in a world that is now being measured in Terabytes, or the use of MIN/MAX functions on an IDENTITY field (that makes no sense!), or a few less pages being created.

Anyone who has spent days trying to find out why something isn't working correctly only to realise that they "accidentally" inserted an IndustryTypeID into an IndustryID, but it never complained because there just happens to be an IndustryID = 3 the same as there is an IndustryTypeID = 3, or why records occasionally said it violated a primary key but only at what appears to be random intervals, etc.


Thursday, March 22, 2012 - 12:51:58 AM - sandeep kumar Back To Top (16573)

How to create cust_id autogenrated conditions are-

1- not match to previous id

2- look like cust_id AB001

Please reply me.

                                 Thank You.

 


Tuesday, March 20, 2012 - 9:53:03 AM - Zeni Back To Top (16534)

 

Thanks. This article was helpful for me in deciding primery key datatype for my database.

Thanks


Tuesday, February 22, 2011 - 5:41:51 AM - Fayssal El Moufatich Back To Top (12997)

Thanks for the nicely written article! So, unless one is in a situation where he needs a glabally unique identifier, one should avoid using GUIDs. And even, in this situation, it is recommended to generate the keys using NEWSEQUENTIALID(), as long as it is possible. I agree! :)


Monday, January 19, 2009 - 6:38:33 PM - aprato Back To Top (2585)

You answered your own question.  You work in a distributed environment and they make sense in this scenario.  Replication makes use of GUIDs for this reason.   I personally wouldn't use a GUID for an application that didn't have a valid business reason for it.  They have their place but they do have some downside.   


Monday, January 19, 2009 - 5:50:12 PM - laughingskeptic Back To Top (2584)

But itsn't this a bit of a red herring?  The natural fill factor of an index on random data (such as a GUID) that experiences primarily inserts is 75%.  If you compact it, then you just cause a bunch of page splits as soon as the inserts begin again, potentially swinging quickly from a fill of 100% to 50% before settling back in to 75%.  A SQL Server page can hold around 254 GUIDs, so if your table has significantly more rows than this you should expect to see close to 100% fragmentation.  For random indexes, average fill factor is a better metric than percent fragmentation.  If the number is not about 75% then an investigation is warranted.

GUIDs are 4 times bigger than an integer, but if your server has reached memory constraints (most real ones do) a GUID index will be 5 times slower than an integer at lookups.  This is a design consideration.

I work with some highly distributed databases, involving multitiered replication and thousands of clients (some being third parties).  We would never get this system to work without using a lot of GUIDs.  GUIDs are especially handy when multiple clients can work standalone then expect to re-sync when they connect.


Monday, October 27, 2008 - 7:40:26 AM - aprato Back To Top (2092)

Hi

I was just stating that you'll still get rapid index fragmentation (heavy, actually) if you decide to create a non-clustered index on a uniqueidentifier that uses NEWID().  It doesn't take many rows.  I had done a test with just 10,000 insertions and I still had an index with 95% fragmentation.  Due to the random generation, you'll get rapid fragmentation; it won't take many rows.   It's just something to be aware of.


Monday, October 27, 2008 - 6:29:39 AM - mharr Back To Top (2091)

[quote user="aprato"]

You should note that even using NEWID() for a non-clustered index will lead to fragmentation for that index. Recall that index structures are sorted logically.

[/quote]

Yes, but:

  • a fragmented index is less likely (since an index "row" will be smaller than a data "row", and will have more entries and more room for new entries in an index), and will be less
  • rebuilding an index to remove fragmentation is easier and less disruptive than reordering a table.
  • while doing "one of " lookups of a primary key where the primary key is a guid (most used query using a primary key), using a fragmented index and unfragmented data pages will likely still be more efficient than using a fragmented data pages where the guid is clustered primary key.

Thursday, October 23, 2008 - 12:38:00 PM - aprato Back To Top (2070)

You should note that even using NEWID() for a non-clustered index will lead to fragmentation for that index. Recall that index structures are sorted logically.


Thursday, October 23, 2008 - 12:13:46 PM - Wiseman82 Back To Top (2069)

[quote user="DiverKas"]

[quote user="mharr"]

Nice article on the advantages/disadvantages of identity and uniqueidentifier keys.  However, there is a relatively simple solution to the problem of fragmentation when using uniqueidentifier keys: don't use the primary key as a clustered index.

By default, when you create a primary key, it is created as a clustered index.  But it does not have to be so.  You can choose any other column as the clustered index, then the data rows will be physically ordered by that column.

How are the queries on the table going to be used?  If it is by and large going to be random access, then perhaps a DateInserted date column would be appropriate.  If there will frequent access of a customer table by territory number, it may be best to set up a clustered index by that column.  Or, choose not to have any clustered index, and rows will always be added to the end of the last data page, and time-consuming page splits will never occur.

There is no rule, in practice or theory, that the primary key must be a clustered index.  A little more thought on how the table will be used and accessed can resolve the fragmentation issue, and provide better performance.

Hope this helps

[/quote]

That is a great point.  Its silly for Microsoft to have left the clustered index on an identity or guid column for that matter.  It makes no sense and certainly lends nothing to the performance.  It is usually the first thing I have to change on a newly created table.

 

[/quote]

Choosing another column for the clustered index might get around the fragmentation issue, but the primary key is often the best candidate for the clustered index.  Sequential GUIDs might be a better option to get around the fragmentation issue.  Also remember that the fragmentation issue would apply (to a lesser extent) to a non-clustered index created on the GUID column.

It's normally a good idea for the clustered index to be unique - if it's not SQL Server will add a uniqueifier to make the clustered index unique (additional overhead).  Due to the fragmentation issues already mentioned, it's also a good idea to to have an auto-incermenting clustered index - Identities and Sequential GUIDs both fit the bill here.  Another thing you need to consider is if the chosen column(s) for the clustered index are likely to be updated - Ideally you want to pick a column that is never/rarely updated - the primary key also fits the bill here.

I'd advise people to keep the clustered index as the primary key, unless they have a good reason for choosing a different column (MS made this a default for good reason).  I'm not saying that you always should use the primary key as a clustered index, but it's not a bad choice in the absence of another candidate for the clustered index.

In terms of query performance (rather than write performance), you might want to use a column that is frequently used in ORDER BY, GROUP BY, JOINS (foreign keys) or where the column is often used to select a range of values etc. 

As mentioned, a date column might be a good candidate if it's populated with a GetDate()/GetUTCDate() - you would expect a very low number of duplicate values and the column will be auto-incrementing.  It might also unlikely that the column will ever be updated.  If queries frequently sort on this column and filter it by a range of values then it would be a good candidate for the clustered index.

Also note that best practice will vary depending on the type of database.  If you are creating a data warehouse/reporting database rather than an OLTP database, you will have different priorities. 

 

 


Thursday, October 23, 2008 - 9:31:52 AM - Perre Back To Top (2068)

Nice article.  Also good comments telling there are also advantages with guids ... not that you said there aren't.

I personally prefer identity columns ... indead for passing back the key from Stored procedures to the client software, for easiness in SQL statements, joins, etc ... and as you prove nicely to limit page splits, index size, ...

Although I once had a problem when a branch office with it's own SQL server, with the same application and database started using VPN.  Business managers decided the databases should be merged together ... so I hit the problem we had the same clients id's in both databases but regarding different clients ... and of course those id's were spread accross a lot of tables as foreign keys.  By consequence I had a serious job merging the data.  Finally I decided to add an extra field to the autoincremental primary key, telling me it's a client from the branch office or one from the main office.  So nowadays when I have to use a key like that I consider adding an extra column.  Alternativly I could have added 10.000.000 to the id's from the branch office before importing the data ... but I didn't thought about that at that moment and it wouldn't have clearified the difference between clients from both offices inserted after the merge.

An old collegue of me always used GUIDs for his websitedatabases since they seem more complex in the URL ... avoiding the fact if you hit .com?id=127 people would try to enter .com?id=128 which should not be visible to them ... but afterwards I think this isn't a strong argument.  If you should not be allowed to see the page with .com?id=128 ... the website should block it :-)

 Cheers

 

 

 

 

 

 

 


Thursday, October 23, 2008 - 9:22:09 AM - DiverKas Back To Top (2067)

[quote user="mharr"]

Nice article on the advantages/disadvantages of identity and uniqueidentifier keys.  However, there is a relatively simple solution to the problem of fragmentation when using uniqueidentifier keys: don't use the primary key as a clustered index.

By default, when you create a primary key, it is created as a clustered index.  But it does not have to be so.  You can choose any other column as the clustered index, then the data rows will be physically ordered by that column.

How are the queries on the table going to be used?  If it is by and large going to be random access, then perhaps a DateInserted date column would be appropriate.  If there will frequent access of a customer table by territory number, it may be best to set up a clustered index by that column.  Or, choose not to have any clustered index, and rows will always be added to the end of the last data page, and time-consuming page splits will never occur.

There is no rule, in practice or theory, that the primary key must be a clustered index.  A little more thought on how the table will be used and accessed can resolve the fragmentation issue, and provide better performance.

Hope this helps

[/quote]

That is a great point.  Its silly for Microsoft to have left the clustered index on an identity or guid column for that matter.  It makes no sense and certainly lends nothing to the performance.  It is usually the first thing I have to change on a newly created table.

 


Thursday, October 23, 2008 - 9:19:29 AM - mharr Back To Top (2066)

Nice article on the advantages/disadvantages of identity and uniqueidentifier keys.  However, there is a relatively simple solution to the problem of fragmentation when using uniqueidentifier keys: don't use the primary key as a clustered index.

By default, when you create a primary key, it is created as a clustered index.  But it does not have to be so.  You can choose any other column as the clustered index, then the data rows will be physically ordered by that column.

How are the queries on the table going to be used?  If it is by and large going to be random access, then perhaps a DateInserted date column would be appropriate.  If there will frequent access of a customer table by territory number, it may be best to set up a clustered index by that column.  Or, choose not to have any clustered index, and rows will always be added to the end of the last data page, and time-consuming page splits will never occur.

There is no rule, in practice or theory, that the primary key must be a clustered index.  A little more thought on how the table will be used and accessed can resolve the fragmentation issue, and provide better performance.

Hope this helps


Thursday, October 23, 2008 - 9:15:52 AM - DiverKas Back To Top (2065)

[quote user="cbasoz"]

I found the conclusions at the end of the article a little bit biased to the identity usage. OK so far so good, IDENTITY has advantages. Doesn't it have any clear disadvantages? Does it really outperform GUIDs in performance?

I am not an SQL DBA but more a developer and I think GUIDs have undeniable advantages over IDENTITY. ie:

-I can create the key independantly from the data store. I know it is unique so I wouldn't bother to create data on my laptop totally disconnected and later insert to main db without any change in keys.

-Not needing db store to learn the key also means I wouldn't need roundtrips to server. Those roundtrips might sum to a big amount. Performance wise maybe it is better to use GUIDs then?

I wonder, if GUIDs are disadvantegous, why would ever SQL server itself use GUIDs in replication? SQL server team developers might do that with integer key fix ups, but they chose GUID. Probably they thought using GUIDs is cheaper than doing key fix ups?

(just thinking loud)

[/quote]

 There is a rather large performance difference between comparing 4bytes and 16 bytes of data, especially in join conditions.  This isnt a trival difference.

Also, your assumption that GUIDs are unique, by creating them on the client is false.  GUIDs have a reasonable chance of being unique, it is not guaranteed, and I see cases monthly where it is not true.  Since the GUID on a client is generated based on several pieces of "hardware" and time, it is quite possible through oem vendor mistakes and luck of the draw to have the same GUID assigned on different hardware.

GUIDs as a general rule should only be used for replication and distributed data.  In general it is also best to generate the key from a single source, thereby reducing the chance of a collision to about nil.

For performance sake, and if replication and distributed data is not a concern, Identity columns are a HUGE winner.

 


Thursday, October 23, 2008 - 8:59:47 AM - aprato Back To Top (2064)

I didn't take it negatively at all.  I understand what you're driving at.  


Thursday, October 23, 2008 - 8:36:40 AM - cbasoz Back To Top (2063)

I was referring to section starting with:

"It's evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches"

and you say:

"The reason they probably chose GUIDs is due to their uniqueness across space and time."

It sounds like a very good reason to choose a surrogate primary key to me so in the same manner I could say:

It's evident that using GUID to auto-generate key values offers a few advantages over the IDENTITY approaches

this statement is a fact too but I didn't see any mention and naturally thought it as biased. I didn't have a negative intention just meant I felt the article was not completed.


Thursday, October 23, 2008 - 8:21:42 AM - aprato Back To Top (2062)

 <<Yes - more or less.  You might also consider composite keys - using identity in conjunction with another column. You could also generate your own "unique" identifier based on some custom logic. You might even consider splitting identity ranges between tables. In this situation GUIDs would probably be the most attractive option IMO.>>

I would concur with Wiseman.  GUIDs may be the better option if you're working in a distributed environment. 


Thursday, October 23, 2008 - 7:59:01 AM - Wiseman82 Back To Top (2061)

M[quote user="pmh4514"]

Thanks for the followups to my questions!

 [quote user="aprato"]The reason they probably chose GUIDs is due to their uniqueness across space and time.[/quote]

If uniqueness across space in time is a fundamental requirement of a given distributed database archicture, am I correct in saying that this requirement alone makes the decision for us, and requires the use of GUID over IDENTITY, despite any inherent performance issues that come with that decision?

[/quote]

Yes - more or less.  You might also consider composite keys - using identity in conjunction with another column. You could also generate your own "unique" identifier based on some custom logic. You might even consider splitting identity ranges between tables. In this situation GUIDs would probably be the most attractive option IMO.


Thursday, October 23, 2008 - 7:49:52 AM - pmh4514 Back To Top (2060)

Thanks for the followups to my questions!

 [quote user="aprato"]The reason they probably chose GUIDs is due to their uniqueness across space and time.[/quote]

If uniqueness across space in time is a fundamental requirement of a given distributed database archicture, am I correct in saying that this requirement alone makes the decision for us, and requires the use of GUID over IDENTITY, despite any inherent performance issues that come with that decision?

 

 

 


Thursday, October 23, 2008 - 7:31:55 AM - aprato Back To Top (2059)

 <<I found the conclusions at the end of the article a little bit biased to the identity usage>>

In reality, it wasn't bias - they were statements in fact.  Using IDENTITY results in less database bloat.  Fewer data and index pages are generated (less churn for DML statements) as well as less bloat of non-clustered indexes if the GUID is the clustered index.    

The reason they probably chose GUIDs is due to their uniqueness across space and time.


Thursday, October 23, 2008 - 7:24:53 AM - Wiseman82 Back To Top (2058)

GUIDs do have benefits over identity columns but they come with quite a large overhead - 16 bytes compared to 4bytes for an integer.  Some of the problems with uniqueidentifiers can be overcome by using sequential guids, but identity is far better from a performance point of view.

Bottom line: Identities should be your first choice. If an identity column doesn’t fit the bill, consider a GUID. 

Also, don’t take peoples word for it.  Construct your own performance test to see how they perform in comparison to each other.  I’ve done so myself and found the difference to be significant.  The difference between sequential guids and non-sequential guids was also VERY significant.

 


Thursday, October 23, 2008 - 7:13:44 AM - aprato Back To Top (2057)

 

<<Question 1 would be "how are tables defraged?">>

 Depends on your version of SQL Server

2000: DBCC DBREINDEX or DBCC INDEXDEFRAG
2005: ALTER INDEX specifying either REBUILD or REORGANIZE

You could also drop and re-create the clustered index

<<2. Describe "fill factor" - how would I implement that? What is it specifying?>>

Creating indexes with a FILL FACTOR % tells the engine how full you want the index leaf pages (i.e. a FILL FACTOR of 80 means keep 20% of the leaf free for future inserts - or 80% full).  It minimizes splits (an intensive operation where 1/2 of the data on a page is moved to a new page) because the free space can accommodate any random row that can fit on the page.  It's used in high volume enviorments where page splitting becomes a problem.

 

 


Thursday, October 23, 2008 - 6:47:48 AM - cbasoz Back To Top (2056)

I found the conclusions at the end of the article a little bit biased to the identity usage. OK so far so good, IDENTITY has advantages. Doesn't it have any clear disadvantages? Does it really outperform GUIDs in performance?

I am not an SQL DBA but more a developer and I think GUIDs have undeniable advantages over IDENTITY. ie:

-I can create the key independantly from the data store. I know it is unique so I wouldn't bother to create data on my laptop totally disconnected and later insert to main db without any change in keys.

-Not needing db store to learn the key also means I wouldn't need roundtrips to server. Those roundtrips might sum to a big amount. Performance wise maybe it is better to use GUIDs then?

I wonder, if GUIDs are disadvantegous, why would ever SQL server itself use GUIDs in replication? SQL server team developers might do that with integer key fix ups, but they chose GUID. Probably they thought using GUIDs is cheaper than doing key fix ups?

(just thinking loud)


Thursday, October 23, 2008 - 6:45:33 AM - Wiseman82 Back To Top (2055)

You defrag tables by doing an clustered index rebuild/reorganize.  BOL will give you more info on this.

A lower fill factor will leave room for more items to be inserted, reducing page splits and improving performance of inserts.  Lower fill factors will have a negative impact on select performance though.  With an auto-increasing key (identity/seq guid) you can use a high fill factor - new items will be inserted at the end of the index. Fill factors are specified when the index is created - BOL will provide you with more info.


Thursday, October 23, 2008 - 6:45:00 AM - aprato Back To Top (2054)

<< There is a problem with this image - SQL Server actually sorts GUIDs by byte group from right to left >>

Yes, you're correct.  The image is meant to shows the differences in how the engine generates the value (i.e. the randomness of NEWID() vs NEWSEQUENTIALID())


Thursday, October 23, 2008 - 6:24:53 AM - pmh4514 Back To Top (2053)

Very interesting..I'm putting together a new distributed DB design using GUIDs as unique identifiers to ensure uniquness across many computers. We are still in early development, so design and implementation changes happen daily as I work through everything. I found this article interesting because it may represent some things I can do early to prevent some of the inherent problems with using GUIDs as identities. But there were a few points I was unsure about.

The author described page splitting that occurs, when discussing the defragmentation differences between using an incremental integer for an identity vs. a GUID.  He wrote:

Though you can defragment the NEWID() table, the random nature of the key generation will still cause page splitting and fragmentation with all future table INSERTs. However, page splitting can be minimized by specifying an appropriate FILL FACTOR.

Question 1 would be "how are tables defraged?" (I am more a programmer than a DBA for what it's worth) and 2. Describe "fill factor" - how would I implement that? What is it specifying?

 

 

 


Thursday, October 23, 2008 - 2:59:58 AM - jnollet Back To Top (2050)

Great article ... I've found this out too and its important to think about up front in the development process.  Trying to change later can be difficult.















get free sql tips
agree to terms