UNIQUEIDENTIFIER vs BIGINT

  • Jeff Moden (10/3/2012)


    ScottPletcher (10/3/2012)


    Determing the appropriate column actually takes analysis; again, there is no one-rule-fits-all-tables for the best clustered key column.

    Customer ids are usually unique and ever-increasing; narrowness is sometimes not as good because char is used even though when the values are purely numeric. Even so, often a somewhat larger clustered key is best overall for performance.

    But you didn't say Customer "ID"... you said Customer "Code". Perhaps it's a difference in terminology but I've found that the two are usually drastically different.

    So far as the no one-rule-fits-all-tables, I agree. That notwithstanding, the rule I stated is very good for larger tables because of the implications on non-clustered indexes. About the only real analysis needed on such tables is trying to determine if the first column of the clustered index should be some sort of date column or not. A narrow and unique secondary column, such as an IDENTITY column would be necessary or an 8 byte internal row ID would be added to the clustered index. Narrowness is always a good thing in an index simply because you can fit more rows per page on indexes and the clustered index is automatically appended to every non-clustered index that doesn't explicitly use the clustered index column(s).

    I don't think such extremely simplistic rules always work well in tables, large or small.

    Choosing the best indexes takes a lot of careful consideration of your data distribution and your query usage patterns, and there is no one-size-fits-all answer.

    If a table is declared with a pk (which is advisable), by default a clustered index will be built on the primary key column(s). However, this is not always the best choice. The primary key is, by definition, unique, and nonclustered indexes are very efficient at finding a single qualifying row and can still enforce the primary key's uniqueness. So save your clustered index for something that will benefit more from it.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (10/4/2012)


    I don't think such extremely simplistic rules always work well in tables, large or small.

    Choosing the best indexes takes a lot of careful consideration of your data distribution and your query usage patterns, and there is no one-size-fits-all answer.

    If a table is declared with a pk (which is advisable), by default a clustered index will be built on the primary key column(s). However, this is not always the best choice. The primary key is, by definition, unique, and nonclustered indexes are very efficient at finding a single qualifying row and can still enforce the primary key's uniqueness. So save your clustered index for something that will benefit more from it.

    Yes, you are right! There is no one-size-fits-all answer to that.

    However...

    I guess it was some long discussions between SQL Server (and back to old days Sybase) design engineers/architects about making PK clustered by default. They decided to do so, and not only for the reason "not to create heaps". PK is not always the best choice for clustered, but quite often it is. Use of Surrogate keys for PK is quite popular in modern RDBMS's (at least in SQL Server and Oracle), so may be that is another statistical manifestation example of 80-20 rule...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (10/4/2012)


    ScottPletcher (10/4/2012)


    I don't think such extremely simplistic rules always work well in tables, large or small.

    Choosing the best indexes takes a lot of careful consideration of your data distribution and your query usage patterns, and there is no one-size-fits-all answer.

    If a table is declared with a pk (which is advisable), by default a clustered index will be built on the primary key column(s). However, this is not always the best choice. The primary key is, by definition, unique, and nonclustered indexes are very efficient at finding a single qualifying row and can still enforce the primary key's uniqueness. So save your clustered index for something that will benefit more from it.

    Yes, you are right! There is no one-size-fits-all answer to that.

    However...

    I guess it was some long discussions between SQL Server (and back to old days Sybase) design engineers/architects about making PK clustered by default. They decided to do so, and not only for the reason "not to create heaps". PK is not always the best choice for clustered, but quite often it is. Use of Surrogate keys for PK is quite popular in modern RDBMS's (at least in SQL Server and Oracle), so may be that is another statistical manifestation example of 80-20 rule...

    The problem is, that's nothing close to the real ratio in production. ~60+% of legacy tuning for me is just changing the dopey identity clustered index to the proper clustered index, with an easy 50+% gain in performance.

    A clustering key does NOT have to be unique to aid performance. That is the worst secondary myth that gets propogated by the main myth about always making the clus key an identity. It's just a thoughtless approach.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Eugene Elutin (10/4/2012)


    Yes, you are right! There is no one-size-fits-all answer to that.

    However...

    I guess it was some long discussions between SQL Server (and back to old days Sybase) design engineers/architects about making PK clustered by default. They decided to do so, and not only for the reason "not to create heaps". PK is not always the best choice for clustered, but quite often it is. Use of Surrogate keys for PK is quite popular in modern RDBMS's (at least in SQL Server and Oracle), so may be that is another statistical manifestation example of 80-20 rule...

    I think that in more than 20% of cases a large enough proportion of the work required of the dbms involves searches for a range of values in some initial columns of the natural key. A surrogate key may of course still be useful for reducing the storage required in other tables for references to this table (and consequently reducing the number of in those other tables pages handled by joins with this table), but while it does no harm to make the surrogate the primary key it is not at all sensible in these cases to make it the cluster key - the natural key should have a unique constraint on it and non-null constraints on each of its columns (ie the table should be normalised at least to 1NF, to keep the code manipulating tha table simple - actually I generally can't see any point in normalisation to lower levels than EKNF, but only 1NF is relevant to this discussion) and either that index (if it's narrow enough) or an index on the relevant intial part (perhaps just one column) of the natural key should be used for clustering.

    Tom

  • Eugene Elutin (10/4/2012)


    Yes, you are right! There is no one-size-fits-all answer to that.

    However...

    I guess it was some long discussions between SQL Server (and back to old days Sybase) design engineers/architects about making PK clustered by default. They decided to do so, and not only for the reason "not to create heaps". PK is not always the best choice for clustered, but quite often it is. Use of Surrogate keys for PK is quite popular in modern RDBMS's (at least in SQL Server and Oracle), so may be that is another statistical manifestation example of 80-20 rule...

    I think that in more than 20% of cases a large enough proportion of the work required of the dbms involves searches for a range of values in some initial columns of the natural key. A surrogate key may of course still be useful for reducing the storage required in other tables for references to this table (and consequently reducing the number of in those other tables pages handled by joins with this table), but while it does no harm to make the surrogate the primary key it is not always sensible in these cases to make it the cluster key - the natural key should have a unique constraint on it and non-null constraints on each of its columns (ie the table should be normalised at least to 1NF, to keep the code manipulating tha table simple - actually I generally can't see any point in normalisation to lower levels than EKNF, but only 1NF is relevant to this discussion) and often either that index (if it's narrow enough) or an index on the relevant intial part (perhaps just one column) of the natural key should be used for clustering.

    Every case needs careful analysis, and I don't believe there's an 80 to 20 rule in favour of clustering on a surrogate.

    Tom

  • Hi,

    I was reading about PK and Clustered Indexes over at sqlskill.com and after reading too much I guess I burned a fuse...

    So pardon if this is a stupid question...

    The PK has to be UNIQUE and NOT NULL, the CL index doesn't necessarily have to be on the PK (example: PK a IDENTITY column for FK and joins, CL over a natural key like customerCode).

    If the CL columns can be updated or aren't sequential is it a good choise for CL or should we use the PK INDENTITY column for CL?

    The non CL indexes "point" to what value? The PK or the CL index?

    Thanks,

    Pedro



    If you need to work better, try working less...

  • PiMané (10/10/2012)


    Hi,

    I was reading about PK and Clustered Indexes over at sqlskill.com and after reading too much I guess I burned a fuse...

    So pardon if this is a stupid question...

    The PK has to be UNIQUE and NOT NULL, the CL index doesn't necessarily have to be on the PK (example: PK a IDENTITY column for FK and joins, CL over a natural key like customerCode).

    If the CL columns can be updated or aren't sequential is it a good choise for CL or should we use the PK INDENTITY column for CL?

    There's no hard and fast rule for this. If you're strictly talking about the physical storage and retrieval of the clustered index (CI, not CL, would be a better abbreviation, then NCI comes from non-clustered index), then yes, a monotonically increasing value makes for a more efficient clustered index. But no, I wouldn't recommend just using an identity value as the primary key or as the clustered index. It really depends on how data is going to be stored and retrieved from within the table. I'm not a purist in regards to artificial or natural keys. I tend to bend my designs around what works best in a situation (and I know that cranks up both parties, but what can I say). You can get magnificent performance out of compound key values (as an example) or datetime or even strings, if those values are the best way to store and retrieve the data and your queries are written to use them.

    The non CL indexes "point" to what value? The PK or the CL index?

    Thanks,

    Pedro

    On a table that is stored as a clustered index, the non-clustered indexes use the key values of the clustered index as lookup values for the columns that are not some part of the non-clustered index. If a table is stored as a heap, an artificial identifier, called a row identifier, or RID, is used to track down those additional columns.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Grant Fritchey (10/10/2012)


    On a table that is stored as a clustered index, the non-clustered indexes use the key values of the clustered index as lookup values for the columns that are not some part of the non-clustered index. If a table is stored as a heap, an artificial identifier, called a row identifier, or RID, is used to track down those additional columns.

    So if we have a CI on a natural key, say for example 3 columns (invoiceType CHAR(2), invoiceNumber INT, productCode NVARCHAR(20)) all the NCI would have the CI key... those same values... big "waste" of space considering that even a GUID is smaller and faster to lookup than 3 column key, right?!

    Thanks,

    Pedro



    If you need to work better, try working less...

  • The nvarchar(20) does make that a problematic clustering key, but it still depends.

    A three-column is not slower to do lookup on if that's what you (almost) always specify in the WHERE clause.

    But, to avoid forty extra bytes, you should consider making the clus key just ( invoiceType, invoiceNumber ), assuming that is still reasonably selective. Typically narrowing it down to a few hundred rows (typical worst case) is good enough. You can create a nonclus index with all three columns if needed.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (10/10/2012)


    The nvarchar(20) does make that a problematic clustering key, but it still depends.

    A three-column is not slower to do lookup on if that's what you (almost) always specify in the WHERE clause.

    But, to avoid forty extra bytes, you should consider making the clus key just ( invoiceType, invoiceNumber ), assuming that is still reasonably selective. Typically narrowing it down to a few hundred rows (typical worst case) is good enough. You can create a nonclus index with all three columns if needed.

    We can have the PK with IDENTITY or GUID (in case replication or synchronization is needed across other databases) to use in FKs and use the most selective columns for CI (as long as they aren't very wide, if they are toss the biggest ones out...).

    Basically it's this? (in this case since there's no narrow natural key to be PK, if there was it would be PK and CI).

    Just one more thing I just remembered... CI doesn't have to be unique, since it doesn't have to be the PK, so if other NCI indexes have the CI data to go to the page won't that make a RID Lookup since CI isn't unique, instead of Key Lookup?

    Thanks,

    Pedro



    If you need to work better, try working less...

  • PiMané (10/10/2012)


    Grant Fritchey (10/10/2012)


    On a table that is stored as a clustered index, the non-clustered indexes use the key values of the clustered index as lookup values for the columns that are not some part of the non-clustered index. If a table is stored as a heap, an artificial identifier, called a row identifier, or RID, is used to track down those additional columns.

    So if we have a CI on a natural key, say for example 3 columns (invoiceType CHAR(2), invoiceNumber INT, productCode NVARCHAR(20)) all the NCI would have the CI key... those same values... big "waste" of space considering that even a GUID is smaller and faster to lookup than 3 column key, right?!

    Thanks,

    Pedro

    It can be, but, again, I can't throw out the concept because I've got a slightly fat key. It really depends on the situation. And, primarily, the results of testing.

    As an aside, the CHAR(2) would be a really poor choice for the first column in a compound key. You'd want to use either the INT or the NVARCHAR(20) because they will generate more efficient statistics.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • PiMané (10/11/2012)


    ScottPletcher (10/10/2012)


    The nvarchar(20) does make that a problematic clustering key, but it still depends.

    A three-column is not slower to do lookup on if that's what you (almost) always specify in the WHERE clause.

    But, to avoid forty extra bytes, you should consider making the clus key just ( invoiceType, invoiceNumber ), assuming that is still reasonably selective. Typically narrowing it down to a few hundred rows (typical worst case) is good enough. You can create a nonclus index with all three columns if needed.

    We can have the PK with IDENTITY or GUID (in case replication or synchronization is needed across other databases) to use in FKs and use the most selective columns for CI (as long as they aren't very wide, if they are toss the biggest ones out...).

    Basically it's this? (in this case since there's no narrow natural key to be PK, if there was it would be PK and CI).

    Just one more thing I just remembered... CI doesn't have to be unique, since it doesn't have to be the PK, so if other NCI indexes have the CI data to go to the page won't that make a RID Lookup since CI isn't unique, instead of Key Lookup?

    Thanks,

    Pedro

    If the clustered index does not have a unique key value (and more importantly, you don't use UNIQUE when making that index), then SQL Server will add a value internally. This is called a uniquifier and is a 4 byte int value. This makes non-unique clustered indexes a little more problematic to juggle. Hence, a natural bias is built in to make the clustered index a primary key. But it's just a bias, not a requirement or even a best practice.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • One more if I may Grant...

    I've already discussed this point but if you can enlighten me a little more I'd be grateful 🙂

    Suppose I have an index I1 with C1, C2, C3 and C4. I want to search for C1 and C4.

    Execution plan says an index seek is made over I1... it seeks C1 data but then it has to get ALL C2 and C3 on the C1 and then seach C4 again... is this equivalent to making a Seek, Scan, Seek inside the same index or just a simple Seek is made, which would mean that I1 is a substitue for all the index combinations starting with C1 column for those 4 columns...

    Thanks,

    Pedro



    If you need to work better, try working less...

  • PiMané (10/11/2012)


    One more if I may Grant...

    I've already discussed this point but if you can enlighten me a little more I'd be grateful 🙂

    Suppose I have an index I1 with C1, C2, C3 and C4. I want to search for C1 and C4.

    Execution plan says an index seek is made over I1... it seeks C1 data but then it has to get ALL C2 and C3 on the C1 and then seach C4 again... is this equivalent to making a Seek, Scan, Seek inside the same index or just a simple Seek is made, which would mean that I1 is a substitue for all the index combinations starting with C1 column for those 4 columns...

    Thanks,

    Pedro

    A seek is a targeted scan anyway, so it's finding the starting point using the first column, then scanning through the values to find the rest. If you have a plan that you want to compare this on, take a look at the properties to see what it did.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I created the following script for testing...

    CREATE TABLE idxTest (Id INT NOT NULL IDENTITY, Col1 INT, Col2 INT, Col3 INT, Col4 INT, CONSTRAINT pk_idxTest PRIMARY KEY CLUSTERED (id))

    SET NOCOUNT ON;

    DECLARE @i INT = 1

    WHILE @i < 100000

    BEGIN

    INSERT INTO idxTest (Col1, Col2, Col3, Col4) VALUES (@i/2, @i/3, @i/4, @i/5)

    SET @i = @i +1

    END

    CREATE INDEX idxTest_01 ON idxTest (Col1, Col2, Col3, Col4)

    CREATE INDEX idxTest_02 ON idxTest (Col1, Col4)

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    PRINT '-------------------------------------'

    SELECT id FROM idxTest WITH(INDEX(idxTest_01)) WHERE Col1 BETWEEN 10 AND 200 AND Col4 > 60

    PRINT '-------------------------------------'

    SELECT id FROM idxTest WITH(INDEX(idxTest_02)) WHERE Col1 BETWEEN 10 AND 200 AND Col4 > 60

    PRINT '-------------------------------------'

    Execution plan says the 1st one takes 55% and 2nd 45%.

    Statistics IO on the 1st has:

    Table 'idxTest'. Scan count 1, logical reads 5, physical reads 3, read-ahead reads 2, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    The 2nd:

    Table 'idxTest'. Scan count 1, logical reads 2, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Perfectly normal since the idxTest_01 is wider and more reads are needed.

    After executing the 2nd time without DBCC commands the physical and read-ahead reads were gone.

    Statistics time was almost the same, the 1st toke 1ms and the 2nd 0ms.

    Imagining the possible combinations of all the 4 columns, 16 indexes, is it worth the performance gain comparing to only having 4 indexes with all 4 columns?

    Also the number of records is never to big since we have data paging with 20 or 40 records...

    Thanks,

    Pedro



    If you need to work better, try working less...

Viewing 15 posts - 16 through 30 (of 40 total)

You must be logged in to reply to this topic. Login to reply