How to overcome the identity column "jump 1000" issue

  • alejsek - Thursday, March 9, 2017 7:50 AM

    Eric M Russell - Thursday, March 9, 2017 7:10 AM

    I can see how an accountant, a data analyst, or a CIO with Aspbergers might be unnerved by the sudden presence of "missing" IDs in a financial report. For that reason, it may be a good idea to design the system from start so that identity columns are routinely padded with a random number of gaps , just so no one cultivates an expectation for perfect sequentiality.

    Intentionally wasting numbers just to cover bad design, really?

    There are still people arguying shat sequence where 1% numbers are gaps and where 98% are gaps are basically the same. They are basically the same, but effectively different.
    - Estimates of heuristics based on ranges go crazy when data density change by factor of 50x for part of the sequence. SQL server statistics are probably advanced enough to suck it up in most cases, but there is a lot more software out there in the world. And a lot of software simply can not be rewritten.
    - If You find out that You could change data type from int to small int, but You actually can not, because the sequence is two orders of magnitude inflated due to voids, it is not nice, You would need twice as much storage forever (500MB more in my case). Efficiency of Yout data storage is affected.
    - Now You can not use identity at tinyint at all, because server would eat up all numbers itselt. Using identity on smallint is badly affected in therms of available address space.

    The worst thing is, that this change is breaking unnecessarily, there are at least two easy ways to make it without significant negative impact. Firstly identity caching can be postponend until table has enought rows that it does not matter that much, for 1M rows table the skipping of 1k numbers is not big deal as it would not mean significant change in data distribution. Secondly identity can be automatically restored to the right value after server restart.

    What's "bad design" is an application that depends on a surrogate identifier to be sequential with no gaps. A surrogate key is just a static bookmark, not a ranking or timestamp.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Sean Lange - Thursday, March 9, 2017 8:12 AM

    If the difference in your case between storing an int and a smallint is 500MB you have too many rows for a smallint to be unique anyway so that argument doesn't hold water. If you are really worried about gaps and micromanaging values like this you should use a sequence instead of an identity.

    I thought SEQUENCEs where also cached, leading to possible gaps?
    https://msdn.microsoft.com/en-us/library/ff878058.aspx#Anchor_1

  • Chris Harshman - Thursday, March 9, 2017 9:43 AM

    Sean Lange - Thursday, March 9, 2017 8:12 AM

    If the difference in your case between storing an int and a smallint is 500MB you have too many rows for a smallint to be unique anyway so that argument doesn't hold water. If you are really worried about gaps and micromanaging values like this you should use a sequence instead of an identity.

    I thought SEQUENCEs where also cached, leading to possible gaps?
    https://msdn.microsoft.com/en-us/library/ff878058.aspx#Anchor_1

    Yes they are. I was commenting on the thought that they would be able to save 500MB by switching from int to smallint and making the argument that there shouldn't be gaps. Honestly if there is a system that has an issue with gaps in an auto generated number the flaw is not in the database having gaps, the flaw is in the design of the system. Gaps are perfectly normal and should be expected. I don't know how many times this topic comes up where people think they need to keep their identity values "clean and tidy". It makes no difference to the database and shouldn't to any application.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange - Thursday, March 9, 2017 9:50 AM

    Yes they are. I was commenting on the thought that they would be able to save 500MB by switching from int to smallint and making the argument that there shouldn't be gaps. Honestly if there is a system that has an issue with gaps in an auto generated number the flaw is not in the database having gaps, the flaw is in the design of the system. Gaps are perfectly normal and should be expected. I don't know how many times this topic comes up where people think they need to keep their identity values "clean and tidy". It makes no difference to the database and shouldn't to any application.

    agreed, sorry for my confusion

  • Chris Harshman - Thursday, March 9, 2017 10:00 AM

    Sean Lange - Thursday, March 9, 2017 9:50 AM

    Yes they are. I was commenting on the thought that they would be able to save 500MB by switching from int to smallint and making the argument that there shouldn't be gaps. Honestly if there is a system that has an issue with gaps in an auto generated number the flaw is not in the database having gaps, the flaw is in the design of the system. Gaps are perfectly normal and should be expected. I don't know how many times this topic comes up where people think they need to keep their identity values "clean and tidy". It makes no difference to the database and shouldn't to any application.

    agreed, sorry for my confusion

    No need to apologize good sir. 🙂 It is all an open and lighthearted discussion.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange - Thursday, March 9, 2017 10:09 AM

    Chris Harshman - Thursday, March 9, 2017 10:00 AM

    Sean Lange - Thursday, March 9, 2017 9:50 AM

    Yes they are. I was commenting on the thought that they would be able to save 500MB by switching from int to smallint and making the argument that there shouldn't be gaps. Honestly if there is a system that has an issue with gaps in an auto generated number the flaw is not in the database having gaps, the flaw is in the design of the system. Gaps are perfectly normal and should be expected. I don't know how many times this topic comes up where people think they need to keep their identity values "clean and tidy". It makes no difference to the database and shouldn't to any application.

    agreed, sorry for my confusion

    No need to apologize good sir. 🙂 It is all an open and lighthearted discussion.

    So long as we all agree that the identity column should be the clustering key.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell - Thursday, March 9, 2017 11:27 AM

    Sean Lange - Thursday, March 9, 2017 10:09 AM

    Chris Harshman - Thursday, March 9, 2017 10:00 AM

    Sean Lange - Thursday, March 9, 2017 9:50 AM

    Yes they are. I was commenting on the thought that they would be able to save 500MB by switching from int to smallint and making the argument that there shouldn't be gaps. Honestly if there is a system that has an issue with gaps in an auto generated number the flaw is not in the database having gaps, the flaw is in the design of the system. Gaps are perfectly normal and should be expected. I don't know how many times this topic comes up where people think they need to keep their identity values "clean and tidy". It makes no difference to the database and shouldn't to any application.

    agreed, sorry for my confusion

    No need to apologize good sir. 🙂 It is all an open and lighthearted discussion.

    So long as we all agree that the identity column should be the clustering key.

    It's a good choice, because it's (generally)* always increasing.

    But you could have a unique "natural" key that's always (generally) increasing and that could be the clustered index, in addition to being unique.

    Such as if you have an invoicing system, you are controlling the creation of the invoices.  So you can create an invoice number key that's not an identity, and in your creation process you can insure that it's (generally) increasing.  Maybe for some reason you want the first 4 digits of the invoice number to be the year.  Since you're creating invoice numbers in chronological order, your scheme is still generally increasing.

    But if you have a unique key that's based on name or geography or whatever number of other things, that won't be generally increasing, then the identity should be the clustered index.

    * - I say (generally) several times because there could be instances when an out-of-order key is inserted, and as long as it's relatively rare, it won't affect the clustered index too much.

  • More than 50% of the time, identity is the wrong clustering key, because there's another column(s) that is(are) more valuable overall as the one and only clustering key. The single most critical performance factor on the table shouldn't just be "defaulted" to by slapping an identity on every table.  Instead, the clus key should be very carefully chosen based on actual table usage.  Moreover, identity often corrupts the table design and key choices simply because other choices are not thoroughly thought through as they should be because "identity is going to be the key anyway".

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Thursday, March 9, 2017 11:53 AM

    More than 50% of the time, identity is the wrong clustering key, because there's another column(s) that is(are) more valuable overall as the one and only clustering key. The single most critical performance factor on the table shouldn't just be "defaulted" to by slapping an identity on every table.  Instead, the clus key should be very carefully chosen based on actual table usage.  Moreover, identity often corrupts the table design and key choices simply because other choices are not thoroughly thought through as they should be because "identity is going to be the key anyway".

    😉

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • ScottPletcher - Thursday, March 9, 2017 11:53 AM

    More than 50% of the time, identity is the wrong clustering key, because there's another column(s) that is(are) more valuable overall as the one and only clustering key. The single most critical performance factor on the table shouldn't just be "defaulted" to by slapping an identity on every table.  Instead, the clus key should be very carefully chosen based on actual table usage.  Moreover, identity often corrupts the table design and key choices simply because other choices are not thoroughly thought through as they should be because "identity is going to be the key anyway".

    but 60% of the time... it works every time.

  • ScottPletcher - Thursday, March 9, 2017 11:53 AM

    More than 50% of the time, identity is the wrong clustering key, because there's another column(s) that is(are) more valuable overall as the one and only clustering key. The single most critical performance factor on the table shouldn't just be "defaulted" to by slapping an identity on every table.  Instead, the clus key should be very carefully chosen based on actual table usage.  Moreover, identity often corrupts the table design and key choices simply because other choices are not thoroughly thought through as they should be because "identity is going to be the key anyway".

    For the type of work I'm doing, I'll have to say that's not true except for some very small lookup tables.  With that, I'll also say "It Depends".  Contrary to popular belief, the clustered index isn't a panacea of performance.  In fact, it can actually kill performance if you come to rely on it for wide, long tables.  There are even times where I simply use an identity column with a clustered index on it to seriously reduce page splits on insert for high volume large tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, March 9, 2017 4:17 PM

    ScottPletcher - Thursday, March 9, 2017 11:53 AM

    More than 50% of the time, identity is the wrong clustering key, because there's another column(s) that is(are) more valuable overall as the one and only clustering key. The single most critical performance factor on the table shouldn't just be "defaulted" to by slapping an identity on every table.  Instead, the clus key should be very carefully chosen based on actual table usage.  Moreover, identity often corrupts the table design and key choices simply because other choices are not thoroughly thought through as they should be because "identity is going to be the key anyway".

    For the type of work I'm doing, I'll have to say that's not true except for some very small lookup tables.  With that, I'll also say "It Depends".  Contrary to popular belief, the clustered index isn't a panacea of performance.  In fact, it can actually kill performance if you come to rely on it for wide, long tables.  There are even times where I simply use an identity column with a clustered index on it to seriously reduce page splits on insert for high volume large tables.

    I'm not basing it on "popular belief", I'm basing it on decades of tuning database tables.  Too wide of a table is definitely a factor, although that's usually a separate design issue.  I can't imagine why anyone would want to cluster an order_items table, say, on anything other than order_id first, and that's hardly a trivial table.  Likewise for a customer emails table, customer address table, etc..  Too much emphasis is placed on a few potential splits and not overall performance.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Yep... me too, Scott.  Decades of experience.  Just, apparently, different experience and that's the key here.  "It Depends".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Grant Fritchey - Thursday, March 9, 2017 6:25 AM

    I've built systems where I only  had the ID columns on every table. I'm that guy. Hopefully, I've learned better now. While I may not want to use a natural key as the PK on a table and it might be a wretched bad choice for the clustered index, I'm still going to have that constraint there to ensure I have clean data because I've done the dirty data thing and sincerely regret it.

    Several years ago I was involved in a project to create a 'single point of truth' at one of the banks. Hardest part - unifying multiple different sets of customer data, most without any unique constraint other than the pk.  The bank had no way of telling that a person with a cheque account was the same person who had a credit card, etc. Hence they regularly used to call people with one account to try and sell them other services, and annoy their customers because the person called already had that service.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster - Friday, March 10, 2017 12:01 AM

    Grant Fritchey - Thursday, March 9, 2017 6:25 AM

    I've built systems where I only  had the ID columns on every table. I'm that guy. Hopefully, I've learned better now. While I may not want to use a natural key as the PK on a table and it might be a wretched bad choice for the clustered index, I'm still going to have that constraint there to ensure I have clean data because I've done the dirty data thing and sincerely regret it.

    Several years ago I was involved in a project to create a 'single point of truth' at one of the banks. Hardest part - unifying multiple different sets of customer data, most without any unique constraint other than the pk.  The bank had no way of telling that a person with a cheque account was the same person who had a credit card, etc. Hence they regularly used to call people with one account to try and sell them other services, and annoy their customers because the person called already had that service.

    Practically all of my tables have both a surrogate key (a clustered identity column) and a natural key (unique non-clustered index), and the surrogate key is the primary key used for foreign key relationships. There may be occasions, something like an event log, where a timestamp combined with another column like DeviceID or PointOfSaleID can serve as a natural primary key without the need for a surrogate key. One issue with using a natural key for the primary key is that the columns composing the natural key may be mutable over time, especially if we're talking about demographic attributes. For example, people change their phone number or email address, or the person's date of birth was entered wrong and then needs to be re-entered correctly at a subsequent visit.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

Viewing 15 posts - 31 through 45 (of 50 total)

You must be logged in to reply to this topic. Login to reply