Artificial key comparison

  • A client’s database is highly normalised (master/details) and uses natural composite keys.  The next move will be to artificial or surrogate keys to improve the speed of joins.  Ignoring for the time being Identity (int or bigint), I would like to find out if anyone has done some benchmark tests for keys implemented as 16 byte varchar (jagged) vs 16 byte char (fixed) vs 16 byte uniqueidentifier (RowGUID).  Thanks for any input.

     

    Cheers,

     

    Win

  • This was removed by the editor as SPAM

  • The composite key is always stored in the index files as fixed length, as far as I know (someone please correct me if I am wrong). Variable length fields used as part of the key are padded out with blanks when the key is stored in the index. ie the key stored in the index files for a composite key made up of a varchar(4) field and a varchar(12) field will always be 16 characters long, the same as for a composite key made up of a char(4) + char(12) fields. Therefore any 16 character key is pretty much the same as any other 16 character key as far as performance.

    The important thing to remember is that for speed the key for the clustered index should be as small as possible as it is used as the pointer to the data for all the non-clustered indexes. The non-clustered indexes are made up of key values and lots of pointers to the clustered index keys (assuming that there is a clustered index, which is generally recommended, for each table you have non-clustered indexes). Therefore, the smaller the clustered index key, the smaller the non-clustered indexes are, speeding up the index reading.

    Remember also that the data is stored in the clustered index (if you have a clustered index on a table) and the index pages get 'split' when new data records are added in the middle of the existing range of key values, causing your tables to get fragmented. If your system is always adding lots of new records try to make sure that the clustered index key is organised so that new records have key values at the end of the current range of key values. This saves you having to constantly reindex the tables to keep the access fast.

    Hope this helps

    Peter

  • If what you are saying is correct SQL SVR is not optimised to handle a ROWGUID  better than any other 16 byte datatype.  The integer Identity looks more attractive.  It avoids index fragmentation due to rising numbers - not necessarily monotonous - but still rising.

    Thanks,

    Win

  • Just some thoughts....

    There is no black or white / right or wrong / true or false when it comes to indices / strategy / performance etc... It's all a balance act on a living breathing entity - thus what is "good" today might turn to "bad" tomorrow.

    There is no such thing as placing the good indices once, and never look again. Indices also need maintenance (periods vary though). Deciding on where to place an index and what it should look like is always a trade-off, choosing the "best". (all this assuming fairly "normal" OLTP systems )

    What I was pondering was the original question posed;

    The next move will be to artificial or surrogate keys to improve the speed of joins.

    I'm fairly sceptic that if you have "bad" performance on the old keys, just implementing another key would somehow magically solve all "bad performace"... Indexing is in my experience a bit more involved than that...

    I was also wondering about the volumes involved. (rowcounts in tables, #rows "normally" returned etc) - Size(s) and volumes do matter in the overall Grand Formula of The Black Art of Indexing

    Assuming there is "bad" performance, then rarely is indexing the only source of "badness"..

    /Kenneth

     

     

  • Note to the coordinator of the thread.  I just wrote a lengthy reponse to Kenneth's comments.  I pressed the preview button and  lost the whole opus.  This is the second time this has happened to me at this website. I wonder what is going on?

    Win

  • Win,

    It happens on some days so I copy my response to the clipboard now, everytime, before I press the Post Reply. It could be because the site is busy at certain times of the day and response times are timing out.

    Yes, the general consensus is that it is very inefficient to use a GUID as a clustered index. One because it is sooo big and the other because it causes the table to get very fragmented very quickly.

    An Identity ID is a good idea for a clustered index if there are a large number of inserts happening all the time (as in an OLTP system). But if the inserts are not continuous and many and the data is generally accessed in the same order, as a range of values (ie you generally get detail records for one master record and they are always wanted in the same order), then the clustered index should be in the order that the records are generally accessed. It means you have to reindex the table or defrag the index more often but the access speed will be quicker (accessing data using the clustered index is faster than accessing the data via a non clustered index). It is a trade off.

    Hope this helps

    Peter

  • Thanks guys. This is a cut and paste from notepad. Twice burnt....

    This database forms the back-end to a commercial data management system.  The front-end is written in C++ and sits on top of either an oracle or sql svr implementation. The front-end application provides its own replication mechanism to merge data collected in the field into regional databases.  This replication can also used between oracle and sql svr based implementation.  For this reason we have to be very careful about the implementation of keys. Natural keys have worked well in this respect.  There is also the potential that regional databases may be replicated in the future into central databases using the applications mechanism or RDBMS-based replication.

    The whole system is more of a warehouse than a busy olap environment, apart from regular data insertion in relative large blocks . Main tables accessed by users (as opposed to master tables) can hold multi-million rows. Users have been known to make the occasional large-scale change  (order of 100,000 rows) at the back-end level which can cascade into more updates.  The more important part is the viewing of data.  Users can define their own grid-based views by mixing virtual fields from the details tables with real fields from the master tables.  These views are editable from the user perspective.  Data are regularly inserted into these views. In the current data model natural primary clustered keys can range from single varchar(max 30) to 5-column compound keys consisting of varchar( max 20) and float.  The performance is not too bad, considering the work involved under the hood. 

    The data model has truely reached its use-by-date.  For the new model we are looking for the best implementation of keys amongst other things, keeping in mind that the data model will be implemented in 2 different RDBMSs. In comparision to existing keys a single-field 16 byte uniqueidentifier (at least in sql svr) looks decidedly better.

    Cheers,

    Win

  • It sounds like you can do a dbcc dbreindex for all the tables or a dbcc indexdefrag after loading data. In which case, you should optimise the keys for reading. But if you are using uniqueidentifiers, you can make them the primary key but you should make them non-clustered. The clustered index should perhaps be an identity field in your case. This will speed up data loading (no splitting of data pages while loading, only index page splitting) and access to the data using any other nonclustered index you have (the nonclustered indexes will be as small as they can be).

    dbcc dbreindex - tables are locked and unavailable, but best option if you can find a time when users don't need the system.

    dbcc indexdefrag - may not always completely work if some table extents in use, but can be done while users working. However it can cause performance problems.

    I am developing/supporting a system that consolidates data from lots of people working at home (using an access database) into a sqlserver database for headoffice processing. We use a composite key (location code, incrementing sequence no.), but one of our developers decided to try using the uniqueidentifier for some of the tables. Strangely, it seems that the uniqueidentifier isn't always unique when generated on an access database by VB6. This isn't helped by the remote system not stopping the users from reusing old keys to enter new data. This is causing me no end of problems at the moment.

    I never used uniqueidentifiers when I worked on Oracle so I can't advise about that. Except Oracle don't use them for any of their own systems I've seen!

    Good luck.

    Peter

  • crap! Now my lengthy answer just vanished when pressing 'post reply'.. AGH!

    Oh, well... Anyway, it was some thoughts about what a move from a natural key to an artificial key may or may not solve from a performance perspective.

    Short example - tableA ( ssn, name )

    ssn is PK and a natural key.

    If you change tableA to use an artificial key, it may look like this;

    tableA ( guid, ssn, name )

    Here guid is declared as PK, but..... In the old table, ssn-PK prevented a ssn to be entered twice. How can we still enforce this when guid is PK? You need to place a unique constraint on ssn, else you may have dupes in there.

    The endresult in this little example is that  we now have an artificial key, but at the cost of two indices instead of one - ie worse performance if we only look at the size of indices placed on the table..

    Anyway, I know it's not the whole truth, but it may be easy to forget in the overall equation...

    /Kenneth

     

  • Thanks for your comments - this is not the first time someone made a comment about the non-uniqueness of ROWGUID. 

    Cheers,

    Win

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply