Are There That Many GUIDs?

  • A GUID is actually 16 bytes, but perhaps you only meant that a GUID is 12 bytes more than a 4 byte Int.

    Yes. I was referring to the delta between a 4 byte integer and a guid.

    In addition to table storage, we must also consider that a unique clustered key is used by SQL Server as a table's row id, so a GUID used as a clustered index gets included in all indexes as well.

    Yep. But what is the actual size difference? If i told you that choice A resulted in a database 4x as big as choice B you might immediately think that choice B is better. Now suppose I told you that the actual difference in size is say the difference between 1 MB and 4 MB. Who cares? Just as I could care less that a smalldatetime is half the size of a datetime, the size difference does not matter nearly as much as the performance difference in the real world with dozens of gigs of memory and very large disk arrays and SANs.

    Regarding the database application situation where where you occasionally need to merge databases, for example client A merges with client B, sequential GUIDs created using NEWSEQUENTIALID() can potentially be duplicated across multiple database instances. Now, in the event of a merger, the client will typically want to segment the data by profit center or division for accounting purposes, so that client_id or center_id column can be combined with a sequential integer identifier to make a key that is reliably unique across all database instances.

    First, if you are building a database driven product system, mergers happen more often than you think. Second, I've come to use COMB guids instead of the NewSequentialId which was added after some of the systems I built were put in place. Technically, yes hacking up the guid reduces its uniqueness. However, just because you use guids does not mean you abandon unique and primary key constraints. Yes, the potential for getting a collision between two systems using guids or hacked up guids is there but it is nearly guaranteed when you use integers.

    Adding another column that identifies the system is a solution fraught with peril. It's possible to accidentally reuse numbers that identify different systems (e.g. customer A and customer B both get ID = 2). Further, it means that every table would need both keys.

  • Thomas-282729 (10/25/2010)


    It amazes me that top notch developers still try to prematurely optimize.

    You and I may be talking about two different things. Doing it right to begin with isn't premature optimization. Take a look through just this forum where there are thousands of posts claiming something like "I have this performance problem and..." or "My log file just ran me out of disk space and..." or "I keep getting timeouts when I try to do a simple insert..." or "I keep getting memory warnings on my brand new server and...". Think about what a GUID will do if someone assigns it as the PK and accepts the default of having the PK also be the clustered index. Such consideration is far from being premature optimization.

    So far as dates go and as you well know, there are fairly large computational advantages to burning the extra 4 bytes to store a date as DATETIME in pre-2k8 databases. No such advantage exists for GUIDs... not even the "guarantee" of uniqueness when merging companies. Oh... I agree that it's likely to be unique but it's not guaranteed to be unique and if a dupe should happen especially when merging companies as you say, you'll still have the same problem as if you used some other form of identifier.

    Once a bit practiced, it doesn't take any longer to do it right all the time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eric Russell 13013 (10/25/2010)


    ...sequential GUIDs created using NEWSEQUENTIALID() can potentially be duplicated across multiple database instances.

    NEWSEQUENTIALID calls the Windows API function UuidCreateSequential, which uses version 1 of the GUID algorithm. This means that GUIDs created this way are only guaranteed to be globally unique if the machine used contains a network card. If a network card is not present, the generated GUID is only locally unique. BTW it is somewhat faster to generate a GUID through NEWSEQUENTIALID than by using NEWID.

    NEWSEQUENTIALID does not actually create truly sequential GUIDs, but SQL Server does guarantee that the next GUID produced will sort higher than the previous one. Don't be fooled into thinking that 'sequential' GUIDs created on separate machines might somehow 'overlap'.

  • On the subject of storage size:

    SQL Server stores data on pages with a fixed size of 8192 bytes. After 96 bytes are reserved for the page header, 8096 bytes remain to be used to store data rows. The number of rows that fit on a page is determined by the size of each row stored, plus overheads. Most often, it is not possible to fill a page completely with data rows.

    To take a simple example, consider a table with an integer key, where each row has a fixed total length of 370 bytes (including all internal overheads). With 8096 bytes available, we can fit 21 such rows on a page, using a total of 7770 bytes. The remaining 326 bytes are insufficient to hold a complete new row, and so remain unused.

    If we were to use a GUID in the key column of our example table instead, we would use an extra 12 bytes per row – 16 bytes for the GUID less the 4 bytes previously used by the index key. The fixed length of each row has now gone up to 382 bytes, but we can still fit 21 such rows on a page, using a total of 8022 bytes. There are now just 74 bytes of β€˜wasted’ space on the page.

    This will not always be the case, of course; the point is to emphasise that increasing the key size slightly does not necessarily directly increase the storage space requirement.

    When used in a non-clustered index, key size considerations become more important. For example, consider a unique non-clustered index, created on a table containing a GUID.

    If the indexed table is a heap, each index record requires a total of 27 bytes. (This breaks down into: 16 bytes for the GUID; 8 bytes for the row identifier that points back to the data row in the table, 2 bytes for the row offset table entry, and 1 byte used internally by SQL Server).

    Each data page used by the index has 8096 bytes available to store index data. This means that a page can store 299 such index records (8096 divided by 27), with 23 bytes of wasted space.

    For the same index created on 4-byte integer keys, the index record size is just 15 bytes, with 539 index records per page (and 11 bytes of wasted space).

    For a table containing a million rows, the GUID index requires 3345 pages. The same index using integer keys would require just 1856 pages.

    Most databases are not that well tuned, however. It is common to see relatively wide indexes (especially at the leaf level where many people seem to think that INCLUDEd columns are for free!) so I am not convinced that the width of a GUID should be a decisive consideration. It depends.

  • Jeff Moden (10/25/2010)


    Thomas-282729 (10/25/2010)


    It amazes me that top notch developers still try to prematurely optimize.

    You and I may be talking about two different things. Doing it right to begin with isn't premature optimization.

    The debate of whether identity values are "doing it right" is never ending. If you do guids right, the performance difference is negligible until you hit large database sizes which the vast majority of solutions will never hit. If a query takes 10ms longer because of a guid, the user is not going to notice. If you use guids and do not use a sequential or comb guid, then you are doing it wrong.

    Take a look through just this forum where there are thousands of posts claiming something like "I have this performance problem and..." or "My log file just ran me out of disk space and..." or "I keep getting timeouts when I try to do a simple insert..." or "I keep getting memory warnings on my brand new server and...".

    How many of those are because the person used a guid and did not choose a sequential or COMB guid? If they did not choose the proper default value, then I would chalk that up to "doing it wrong". Yes, if you are going to use guids you have to know the pitfalls just as you would with any other approach.

    Think about what a GUID will do if someone assigns it as the PK and accepts the default of having the PK also be the clustered index.

    Suppose they use a bigint. "Oh gosh" that's *twice* as big! The horror! The nightmare! That's four more extra bytes! Imagine if they use a cryptographically random value! Even worse! Using a guid as the clustered index is not the end of the world ***IF*** you use something like a COMB guid. Granted, you sacrifice some global uniqueness and when the database gets big there might even be some noticeable performance difference. However, for the vast majority of databases, they won't notice a difference.

    So far as dates go and as you well know, there are fairly large computational advantages to burning the extra 4 bytes to store a date as DATETIME in pre-2k8 databases. No such advantage exists for GUIDs... not even the "guarantee" of uniqueness when merging companies.

    I agree that there is no computational advantage to a guid. When coding an object model, there is an enormous advantage. When having to consolidate systems, there is an enormous advantage. The global uniqueness part of a guid is really a bonus. The fact that you can generate the value in the middle-tier and that there is a nearly 100% chance that it is unique across any systems with which you will deal are the real advantages.

  • Thomas-282729 (10/25/2010)


    Jeff Moden (10/25/2010)


    Thomas-282729 (10/25/2010)


    It amazes me that top notch developers still try to prematurely optimize.

    You and I may be talking about two different things. Doing it right to begin with isn't premature optimization.

    The debate of whether identity values are "doing it right" is never ending. If you do guids right, the performance difference is negligible until you hit large database sizes which the vast majority of solutions will never hit. If a query takes 10ms longer because of a guid, the user is not going to notice. If you use guids and do not use a sequential or comb guid, then you are doing it wrong.

    Take a look through just this forum where there are thousands of posts claiming something like "I have this performance problem and..." or "My log file just ran me out of disk space and..." or "I keep getting timeouts when I try to do a simple insert..." or "I keep getting memory warnings on my brand new server and...".

    How many of those are because the person used a guid and did not choose a sequential or COMB guid? If they did not choose the proper default value, then I would chalk that up to "doing it wrong". Yes, if you are going to use guids you have to know the pitfalls just as you would with any other approach.

    Think about what a GUID will do if someone assigns it as the PK and accepts the default of having the PK also be the clustered index.

    Suppose they use a bigint. "Oh gosh" that's *twice* as big! The horror! The nightmare! That's four more extra bytes! Imagine if they use a cryptographically random value! Even worse! Using a guid as the clustered index is not the end of the world ***IF*** you use something like a COMB guid. Granted, you sacrifice some global uniqueness and when the database gets big there might even be some noticeable performance difference. However, for the vast majority of databases, they won't notice a difference.

    So far as dates go and as you well know, there are fairly large computational advantages to burning the extra 4 bytes to store a date as DATETIME in pre-2k8 databases. No such advantage exists for GUIDs... not even the "guarantee" of uniqueness when merging companies.

    I agree that there is no computational advantage to a guid. When coding an object model, there is an enormous advantage. When having to consolidate systems, there is an enormous advantage. The global uniqueness part of a guid is really a bonus. The fact that you can generate the value in the middle-tier and that there is a nearly 100% chance that it is unique across any systems with which you will deal are the real advantages.

    And all of that is what I'm talking about. Doing it right isn't premature optimization. And considering disk storage for the sake of backup time and recovery time isn't something to be horrified by. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/25/2010)


    And considering disk storage for the sake of backup time and recovery time isn't something to be horrified by. πŸ˜‰

    We should probably also consider that large databases would likely use data compression.

  • Paul White NZ (10/25/2010)


    Jeff Moden (10/25/2010)


    And considering disk storage for the sake of backup time and recovery time isn't something to be horrified by. πŸ˜‰

    We should probably also consider that large databases would likely use data compression.

    Maybe in 2k8+. Not everyone has that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (10/25/2010)


    Maybe in 2k8+. Not everyone has that.

    I was thinking more of backup compression than compression inside the database. For that, we have tools like LiteSpeed, HyperBac, SQLsafe, and SQLbackup...

  • Paul White NZ (10/25/2010)


    Eric Russell 13013 (10/25/2010)


    ...sequential GUIDs created using NEWSEQUENTIALID() can potentially be duplicated across multiple database instances.

    NEWSEQUENTIALID calls the Windows API function UuidCreateSequential, which uses version 1 of the GUID algorithm. This means that GUIDs created this way are only guaranteed to be globally unique if the machine used contains a network card. If a network card is not present, the generated GUID is only locally unique. BTW it is somewhat faster to generate a GUID through NEWSEQUENTIALID than by using NEWID.

    NEWSEQUENTIALID does not actually create truly sequential GUIDs, but SQL Server does guarantee that the next GUID produced will sort higher than the previous one. Don't be fooled into thinking that 'sequential' GUIDs created on separate machines might somehow 'overlap'.

    Two instances of SQL Server on the same machine would share the same network card, and if they are running the same database application, then it seems to me that it's possible a GUID based on network id + timestamp or network id + sequential id could overlap between them.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • This why I keep saying that some type of identifier unique to each instance of the application or user, like client_id or cashregister_id, when combined with a sequential integer or date/time stamp is the way to go (assuming the application database was initially setup correctly). If the operating environment wasn't setup correctly, then you need fix that and re-code the data. For example, if you have a POS system running locally at multiple retail outlets, and each sales transactions contains (store_id or cashregister_id) and sale_datetime, then you can merge those transactions into a central datawarehouse with no collisions. Since these columns already exist on the sales transaction record anyhow, there should be no need for any algorithm based surrogate key. If you tie the primary key with a physical entity in the real world, like an input device or a person, and also include date/time for context, then it's inherently unique.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric Russell 13013 (10/26/2010)


    Paul White NZ (10/25/2010)


    Eric Russell 13013 (10/25/2010)


    ...sequential GUIDs created using NEWSEQUENTIALID() can potentially be duplicated across multiple database instances.

    NEWSEQUENTIALID calls the Windows API function UuidCreateSequential, which uses version 1 of the GUID algorithm. This means that GUIDs created this way are only guaranteed to be globally unique if the machine used contains a network card. If a network card is not present, the generated GUID is only locally unique. BTW it is somewhat faster to generate a GUID through NEWSEQUENTIALID than by using NEWID.

    NEWSEQUENTIALID does not actually create truly sequential GUIDs, but SQL Server does guarantee that the next GUID produced will sort higher than the previous one. Don't be fooled into thinking that 'sequential' GUIDs created on separate machines might somehow 'overlap'.

    Two instances of SQL Server on the same machine would share the same network card, and if they are running the same database application, then it seems to me that it's possible a GUID based on network id + timestamp or network id + sequential id could overlap between them.

    It's theoretically possible, but because of the way threads work in a CPU, it almost never happens.

    Try simultaneously inserting a million GUIDs into temp tables on two different instances of SQL Server on the same computer, and then run an inner join on the two tables. Odds of any hits at all are very, very low. Try it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Eric Russell 13013 (10/26/2010)


    This why I keep saying that some type of identifier unique to each instance of the application or user, like client_id or cashregister_id, when combined with a sequential integer or date/time stamp is the way to go (assuming the application database was initially setup correctly). If the operating environment wasn't setup correctly, then you need fix that and re-code the data. For example, if you have a POS system running locally at multiple retail outlets, and each sales transactions contains (store_id or cashregister_id) and sale_datetime, then you can merge those transactions into a central datawarehouse with no collisions. Since these columns already exist on the sales transaction record anyhow, there should be no need for any algorithm based surrogate key. If you tie the primary key with a physical entity in the real world, like an input device or a person, and also include date/time for context, then it's inherently unique.

    This works, but does have the disadvantage of things like accidentally re-assigning an existing register ID.

    Of course, either can suffer from someone doing something like copying a database from one register to another in order to get a crashed register up and running again, or something like that. (Less likely with cash registers, highly possible with other business applications.) In a case like that, collisions are inevitable, but can usually be handled easily.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Eric Russell 13013 (10/26/2010)


    Two instances of SQL Server on the same machine would share the same network card, and if they are running the same database application, then it seems to me that it's possible a GUID based on network id + timestamp or network id + sequential id could overlap between them.

    No. In theory the OS should prevent 'simultaneous' calls to UuidCreateSequential returning the same value - in practice, Microsoft found that under highly concurrent activity on AMD64 systems the OS could return the same value to multiple callers. So, SQL Server protects calls to UuidCreateSequential with a mutex (which is truly global). See http://blogs.msdn.com/b/sqlprogrammability/archive/2006/03/23/559061.aspx

  • GSquared (10/26/2010)


    This works, but does have the disadvantage of things like accidentally re-assigning an existing register ID. Of course, either can suffer from someone doing something like copying a database from one register to another in order to get a crashed register up and running again, or something like that. (Less likely with cash registers, highly possible with other business applications.)

    Many years ago, I was the lead DBA for a large international retail chain, using just such a distributed system - with a SQL Server instance (MSDE!) on each of 3000 registers. I can assure you that such occurrences were anything but rare πŸ™‚

Viewing 15 posts - 91 through 105 (of 169 total)

You must be logged in to reply to this topic. Login to reply