To those that don’t know me “Natural Born Killers (The Series)” sounds like a very strange choice for a SQL Server blog title and to be fair it is. Those who do know me, will know that I have been presenting a session this year titled “Natural Born Killers, performance issues to avoid” around the UK. The “Series” element of the title is there because I have decided to serialise the session for the benefit of those who have been unable to see it thus far.
The plan is to release one topic from the presentation each day of SQL Relay 2013. Today is the fourth day of SQL Relay, this leg is being held in Norwich. If you have registered for the Norwich or Cardiff events you will be able to see this presentation live, so you might want to read this after the event as a reminder of everything that was covered.
This section of the talk is around the design of Indices and what I like to call the golden rules for good design. I have done a session previously just about indices which was recorded for Quest Software. You can watch that here and here (they are slightly different as one was recorded for NAM, the other for EMEA). I also wrote a post about Unused Indices way back as part of T-SQL Tuesday #10: Unused Indexes (Indices) which may be of interest to you.
In the opening of this section I reference a term I have coined called “DWI” which stands for “Data, Workload and Internals”. To create a good indexing strategy you must have a firm grasp of these three areas. Obviously this is just a file on a server somewhere that you are reading right now so there is no way that I can comment on the data you have or the workload that is happening on your server. Hopefully if you read through this you might find out or reaffirm something about the internals aspect.
Golden Rules for Clustered Indices
Before we start this section I should point out that this was written before CTP1 was released and as such the rules I am stating here (IMO of course) may not reflect best practices for OLTP InMemory tables aka Hekaton.
Rule 1: Narrow
Your clustered index should be as narrow as you can make it. Yes there are some circumstances where you might not want this to be the case, but in the whole narrower is better. What do I mean by being narrow? Ideally this should be as small as possible, only one or possibly two attributes. The reason for this is that the clustered key will reside inside your nonclustered index.Exactly where will depend on the type of nonclustered index you have. Needless to say if these clustered keys are larger than they need to be then this “wasted” space is replicated in every nonclustered index on that table. This has knock on effects for your buffer cache, reading in physical IO, larger backups and heaven forbid a longer recovery process in the result of a disaster.
Rule 2: Static
Your clustered key should be as static as possible. I mentioned in rule 1 that the clustered key is stored inside the nonclustered index. This is a pointer back to your base record and is needed if the index you are querying does not “cover” the entire query. If this clustered key were to change value then that record in every nonclustered index on that table would also need to be updated. This is an unnecessary overhead.
The mantra for rules 1 and 2 is problems in design will be exacerbated in every nonclustered index.
Rule 3: Progressive
I normally get some blank looks here during presentations, what I mean by this is that the key should be one that ascends, such as an identity or an index. The reason here is one of fragmentation. If I cannot predict where the next record is to be inserted as it is using a GUID or something like a surname then I am more likely to suffer from fragmentation and very expensive page split operations. Another benefit is that the page is more likely to be in cache if the key is ascending in nature, this means I might not need to wait for an expensive I/O operation before inserting my data.
Rule 4: Unique
Your clustered index should be unique. If your key is not unique then SQL Server will assign a uniquifier for you, this is bad news. To be fair it could be worse, the uniquifier is an integer and so will take up an extra 4 bytes of space in the key where the key is duplicated. Now, here comes the kicker. SQL Server does not store the last value of the duplicate, this means it needs to traverse through the duplicates for that duplicate key to find the last and then add one to it. Pretty nasty stuff.
Rule 5: Fixed Width
Where possible I would recommend going with a fixed width data type. The reason for this is the overhead for storing variable characters. I’m not going to go over this again as I covered it in Day 2 – Table Design. You may also find some strange goings on with regards to memory grants for queries if the length of the data is very variable.
Rule 6: Not NULL
This again is all about space saving and keeping an eye on fragmentation. By having a NULLable column in the key a NULL bitmap needs to be created for each record. If the attribute doesn’t have to be NULL mark it as NOT NULL.
For rules 5 and 6 please use your common sense, I am not suggesting that if you have a variable length column where 99% of the data is less than 5 characters and the remaining 1% is 50 characters that the index should be created as fixed width. What I am suggesting is that you evaluate your data, take a look at what is in the attribute and make the judgement call as to which would save the most space and thereby improve performance.
Most people find that reading rules isn’t the easiest way to learn something, so let’s put it all into context with an example and a bad example at that.
Let’s assume that we are designing some kind of American government database where we have to have a record for each member of the population. According to the internet search I just did that was 313.9 million people in 2012 (I thought it would be more). For ease of calculations on my part lets say 315 million. We’re going to create this theoretical table called Citizens and we’re going to have the following attributes as our clustered primary key:
- Lastname varchar(50) NOT NULL
- Firstname varchar(50) NULL
- MiddleInitials varchar(20) NULL
If we were to revisit our hidden overheads we could foresee that there would be some duplicates. For the sake of ease of calculations let’s say 30 million duplicates each requiring a 4 byte uniquifier. We then have 3 attributes which are of variable length this equates to 2 bytes overhead per attribute and another 2 bytes for the offset. That’s 8 bytes per row. We then need to account for the overhead of the NULL attributes.
This should total up to (30,000,000 * 4) + (315,000,000 * 9 ) = 2,955,000,000 bytes of hidden overhead.
This is made worse by the fact that this overhead will also be in every nonclustered index. That’s why I personally prefer to use a surrogate key such as an identity or a sequence.
Golden Rules for Nonclustered Indices
Rule 1: Have an optimal clustered key
Hopefully the golden rules for clustered indices have made the reason for this rule pretty clear.
Rule 2: Not Narrow
In complete contrast to the rules in the clustered index section here you really need to try and make sure that your indices cover as many of your queries as possible. If your queries are not satisfied by your index then they are likely to require a key lookup which means a call to the base table. As luck would have it tomorrows topic is all about Key Lookups, its almost as if I planned it!
Rule 3: Reduce unnecessary overhead
This rule is exactly the same as for the clustered indices. Where possible you need to reduce overhead of your datatypes. I would urge you to watch one of the webcast recordings as I also mention the impact of versioning which has a hidden overhead.
Rule 4: Consolidate
All too often I see lots of indices that are not being used because there are lots of very similar indices that exist on a table. The main culprit of this tends to be people blindly following the advice of the missing index DMV’s. Whilst this is a great feature it’s not one of the smartest and will suggest a lot of indices that are very similar. If you have some kind of automated process to implement these then I’m afraid you could be crippling your system.
This is where I am going to refer you back to the D and W parts of my DWI strategy as you need to understand how your data is being used and then can successfully consolidate similar indices which will free up space and also other resources during your maintenance windows.
Rule 5: Index Foreign Keys
The SQL Optimizer is a wonderful thing and by creating indices on your foreign keys it will help it to understand the relationship and cardinality of your relationships. Adding this may cause a slight overhead in terms of insertion, but in most cases this is outweighed by the advantage in reads.