An alternative to GUIDs

  • mark hutchinson (9/15/2011)


    There are 86400 seconds in a day, not 84400 that you stated in the article. You should probably change some numbers.

    Well spotted!

    If you aren't happy single, you won't be happy in a relationship.

    Remember, happiness comes from guitars, not relationships.

  • If you are using identity columns and are using multiple servers, you can set each server to use its own `bank of numbers`. E.g.

    Server1: identity seed = 00,000,000,000

    Server2: identity seed = 10,000,000,001

    Server3: identity seed = 20,000,000,001

    etc.

    or:

    Server1: identity seed = 1,000

    Server2: identity seed = 1,001

    Server3: identity seed = 1,002

    increment set to 1,000

    etc.

    Note the above is an example, you would of course plan on your range size, projected usage, etc. to make sure server number usage didn't crash into each other 🙂

    Then if a server goes down and the others keep going, there will not be a problem with clashes when you sync your databases back up again.

    Of course the above isn't a perfect solution, but it's an option, fast, simple, etc.

    M. 🙂

  • Phil

    Glad I could help. I also think I spotted a typo in the article, but upon a re-read, don't see it.

    Rather than recreating the UNIX epoch problem of a ~40 year clock cycle, use a combined key.

    2958465 is the long integer value for #12/31/9999# (9999-12-31 ISO date)

    This can easily fit into three bytes. If you make this the high order three bytes of the key, followed by the milliseconds for that day (since midnight), you get a sequential key good until the Microsoft date epoch. If you wanted more random data, you could make the date data the low order bytes.

    If security is a concern, you could encrypt the datetime stamp.

    Alternatively, you could go with a multi-column key with the different database server identification being one of the columns and a regular row identity value being the other (PK) column.

    btw...there's a good Y2K joke about Bill Gates being awakened from cryogenic suspension just in time to help society with the Y10K crisis.

  • Martin Bastable (9/15/2011)


    If you are using identity columns and are using multiple servers, you can set each server to use its own `bank of numbers`. E.g.

    Server1: identity seed = 00,000,000,000

    Server2: identity seed = 10,000,000,001

    Server3: identity seed = 20,000,000,001

    etc.

    I think the point of this article is to avoid doing something like this. What happens if you happen to run out of numbers within a particular bank, or use a partitioning scheme that eventually exhausts the available numbering space? With the scheme that the article proposes, you know that you have a large timeframe in which you don't have to worry about numbering conflicts.

    Jay Bienvenu | http://bienv.com | http://twitter.com/jbnv

  • mark hutchinson (9/15/2011)


    Phil

    Rather than recreating the UNIX epoch problem of a ~40 year clock cycle, use a combined key.

    Then again, aren't 128 bit bigints on their way already for some next version of SQL Server? This should take care of another 60x 2^64 years or am I too easy on the math here?

    Anyway, the first thing I thought when I read this article was sequential guids, as some fellow posters already posted but I couldn't find a reference to a usable implementation so here's what I used in my previous company. All honors for this go to Leonid Shirmanov who put it on the web: http://www.shirmanov.com/2010/05/generating-newsequentialid-compatible.html

    Leonid's example uses the UuidCreateSequential function from the Windows API: the same as SQL Server uses when you default a column's value to NEWSEQUENTIALID(). So it's rock solid.

    The example from Leonid's site helped me to reduce a busy server's CPU from averaging 40% to averaging 3%. Page splits went down and performance went up. You can put it in a CLR UDF, and/or have your apps use it in code, or put an assembly into the GAC so you can use it anywhere.

    The only downside is that this still uses 16 byte GUIDs as opposed to 8 byte integers, the upside is that a client app doesn't need to call an external service.

  • I have to agree with SSC Eights. Why not use a composite key?

    For many years, when storage was expensive, single fields were often used to depict mutiple entities and there continue to be numerous examples of this in accounting and manufacturing. Similar stratagy was applied when DOS limited file naming to 8 characters.

    Why go back there? What's wrong with 2 fields?

  • sqlservercentral-622386 (9/15/2011)


    If you want to avoid the perfromance problems of using GUIDs as keys, and you do your security properly, then use sequential GUIDs which are already built into SQL Server for preciseley this purpose. If you don't want to generate the ids in-database, the same funciton is available on the client, or app layer, in windows api.

    SQL: NEWSEQUENTIALID()

    Client: UuidCreateSequential()

    The problem with UuidCreateSequential in a distributed environment is you will still get fragmentation. Your data will all cluster together according the client that generated the id instead of always appending to the end of the index. That's why I wrote my own Comb algorithm, to allow distributed ids that reduce fragmentation. You'll always have some fragmentation if you're generating your ids on the client because you can't guarantee the order they will be inserted in because you're looking at a race condition. However, in my opinioin UuidCreateSequential is only useful when you have a single source that is generating the ids.

  • Thanks for putting this article together.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • The question of fragmentation is an interesting one. Assuming we don't use a GUID as the clustering key, what's the issue here? It is in the nature of non-clustered indexes to fragment - which is why FILLFACTOR exists.

    Queries against a GUID column are overwhelmingly likely to be singleton lookups - which aren't affected by fragmentation. I can't think of a scenario where one would want to range-scan GUIDs - and it's disk read-ahead performance for (partial) scans that can suffer if fragmentation is high.

    As far as size is concerned, sure a GUID is 16 bytes - the same size as a single NCHAR(8) column, or two BIGINT columns - but that's really not that big. Just using row-versioning adds 14 bytes per row and we don't hear too many complaints about that.

    Like everything, GUIDs have their advantages and disadvantages. There will certainly be edge cases where it makes sense to invent a new keying system, but the standard range of tools seem perfectly adequate for most scenarios.

  • To answer a few points raised in this discussion thread: -

    Good spot on the 84,400. It is a typo but the figure for 4 and 40 years respectively are correct.

    This illustrates why it is a good idea to write up the full experiment and show your working as the entire thing is up for peer review.

    Good spot also on the time synchronisation. This suggests we need three things

    1. An understanding of when the clocks get resynced and how.

    2. A mechanism for pausing the key generator in the event of it getting wound back

    3. An alerting mechanism should the server clock get too far our of sync with the master clock which indicates you have a hardware problem on your server.

    If obfuscation of the key is required then this will only apply when the key is exposed by the application and then it is the responsibility of the application to present the key in the obfuscated manner. It shouldn't affect the actual value of the key.

    I'd 2nd the comment that you shouldn't have to develop a coding solution to an infrastructure problem but I don't think standing back and saying "not my problem guv" is the answer either.

    I've been in the situation where a large legacy estate had to be shifted to separate data centres in a very short period of time. The only way this could be achieved in the timescales was by adding a ServerID field to every non-reference table. The stored procs had to do a lookup of the @@SERVERNAME against a master table of servers to determine the serverID. Not nice, and it broke when the server naming policy changed causing horrendous difficulties.

    The 14th byte does not have to be used to signify servers, it merely signifies the key generator being used and in this case it is 1 of 10. In my case syncing up the data centres in real time is a challenging proposition.

    Geoclustering is one option but it isn't a panacea for all ills.

    Mirroring has caused us problems with IO for busier DBs and I'm not sure how happy I would be putting high transaction loads across a mirror with a few hundred miles between data centres.

    It is a bit off topic but I'm firmly of the opinion that you have to design your resilience, DR and failover solution as a foundation stone. It isn't something you can bolt on later.

    Why don't I use a composite key?

    Because the data model gets very complicated very quickly and actually the component parts are not of any interest. They serve solely to provide a mechanism for generating a key.

    The purpose of the key generator is to provide as generic a generator as possible. It doesn't rely on a particular DB platform and as we have a hetrogenous environment it opens up a range of possibilities.

    We did consider having multiple DB servers all incrementing by 1000 but starting from different seed points. It works well for most circumstances but you have to make sure that you don't wreck the seed value accidentally.

    Realistically you can push a vast amount of traffic through SQL Server before you need to worry about an external key generator. either that or you move to a cloud solution where you have lots of small independent nodes all doing their own thing.

    If you cluster a GUID then you deserve everything that comes your way. Even if you don't cluster them you get huge fragmentation issues they just don't impact as great as when they are clustered.

    I found out after writing this article that a number of systems use a very similar routine for key generation which is extremely reassuring as external validation always is.

  • round trip of scope_identity()???? it's a return value of your method call. Please don't feed the bears.

  • gahayden (9/15/2011)


    I have to agree with SSC Eights. Why not use a composite key?

    For many years, when storage was expensive, single fields were often used to depict mutiple entities and there continue to be numerous examples of this in accounting and manufacturing. Similar stratagy was applied when DOS limited file naming to 8 characters.

    Why go back there? What's wrong with 2 fields?

    Everything's wrong with two columns: You see, it's not an issue of storage, but the performance impact of using suboptimal datatypes. Using more than one column compounds the problem as both columns, if used as a primary key, have to be present in foreign key columns referencing the primary key. Since there tend to be many more FK records than PK records in your average database structure, the bloat compounds further, impacting performance as it does so.

  • Since there tend to be many more FK records than PK records in your average database structure, the bloat compounds further, impacting performance as it does so.

    Wow, thank God someone finally brought that up! I've dealt with this situation before as well with many different "remote" databases synchronizing into 1 central database with... composite keys. The only thing that could be done was to roll our own synchronization app. The central store had to "re-key" EVERYTHING to it's own server ID and then send those keys back out to the remote servers. Fun!

  • What about this scenario:

    Site A is functioning well and persists orders 1 to 10

    Site B is passive but kept up-to-date so effectively it has copies of orders 1 to 10.

    Lightning strikes after order 10 is placed on Site A but before it's copied to Site B.

    Site B becomes active while repairs take place on Site A

    Because Order 10 is not available from Site B User has to resubmit it.

    It happens not straight away, so the which was Order 10 on Site A gets number 12 on Site B.

    At the same time another, totally irrelevannt order takes Number 10 on Site B.

    Now, what will happen when Site A comes back to life?

    GUID will prevent orders "No.10" from overlapping, but same order with No.10 on Site A and No.12 on Site B will still be duplicated.

    There will be 2 identical deliveries instead of one, 2 following invoices instead of one, missing payment, penalties, disputes, etc.

    Whole idea of using GUID or any other kind of auto-generated "across different instances" identifiers is faulty.

    Unique entries of orders or any other business objects must be identified by natural keys defined by business logic, nothing else.

    IDENTITY is perfectly good as an internal identifier within a database.

    Just don't include it into replication process. Let target database generate its own IDs'.

    _____________
    Code for TallyGenerator

  • Sergiy - It isn't the job of programmers/developers to reinvent the wheel of infrastructure: There are a number of infrastructure topologies that already exist to deal with datacentre destruction scenarios, including database mirroring, log shipping and clustering.

    In the assessment of any disaster recovery (DR) strategy, there are trade-offs that have to be made. These include cost, service levels of how long it should take to failover, how much data it is tolerable to lose and how much of a performance hit one is willing to take if one opts for a synchronous data replication topology (be that at SAN level, synchronous mirroring or otherwise) if there is zero tolerance for data loss. Remember also that a DR strategy is one that should be implemented only when there is a total destruction of the primary site environment, so a well run operational department will probably only every implement the plan as part of a DR training exercise.

    Again, as smart as programmers think they are, the people that design these products are much more familiar with the issues surrounding DR (including the need for standardisation of DR plans across systems) than programmers are and they see DR in the context of enterprise operational procedures, not as custom processes for individual applications.

    In short - if you went into any well-run DBA team and told them you had a different process to implement for DR than the other hundred or so systems they had to support, they'd tell you where to go.

Viewing 15 posts - 31 through 45 (of 64 total)

You must be logged in to reply to this topic. Login to reply