Welcome, SQL University Students to another extension class here at Miskatonic University, home to the Fighting Cephalopods (GO PODS!). Never mind the stains on the floor, or the walls…or those really nasty ones on the ceiling. There was a… oh what did the dean call it… an incident last week when one of the students had a little accident after reading Die Vermiss Mysteriis one too many times. But we’re not here to talk about arcane tomes and unspeakable horrors today. No, today we’re here to talk about clustered indexes.
SQL Server storage is really predicated around the idea of clustered indexes. Don’t believe me? Let’s list a few places that require a clustered index:
What about the fact that the default primary key is clustered? Think that was by accident? How about the fact that when you create a clustered index, it becomes the data? Isn’t it interesting that you create a materialized view, an indexed view, by creating a clustered index? Do you think that the fact that all non-clustered index key values point back to the clustered index is significant?
In short, picking a clustered index is an extremely important undertaking. But, most of the time, people leave the cluster on the default primary key or, worst of all, they remove it entirely to “help” performance. I’m going to quickly address each of these choices,
I’m mostly a “defaults” kind of guy and don’t generally mess with the systems when I set them up. So, since the default for clustered indexes is the primary key, that should be where the overwhelming majority of indexes are left in your database design, right? Well, let’s consider this query:
SELECT sod.UnitPrice, sod.OrderQty FROM Sales.SalesOrderDetail AS sod WHERE sod.ProductID = 927;
And let’s assume, just for this discussion, that query is called constantly and that there are other queries, similar to that one, also called all day long. In short, the most common access path into your data is through the ProductID column. Let’s take a quick look at the execution plan for that query:
Ah, a key lookup. The most common query on the system and we’re paying the cost of a key lookup operation, each and every time. Ask yourself, could this column support an index? Since it has one, yes, it’s selective enough to support an index. Could we change the non-clustered index to make it covering? Maybe, but why do that when we could just modify the cluster and achieve good results since this is the most frequently access path to the data.
Just remember that if a clustered index is not unique, SQL Server will add a value, called a uniquifier, to make it so. This could be another consideration when determining where and what to cluster.
Storage is storage, right? Wrong! How things are stored matters. When a table doesn’t have a clustered index, it’s called a heap. Not a pretty name, is it? Darn right it’s not. That’s because what you’re doing is effectively piling your data into a, well, a heap. It’s not stored in any particular manner, so retrieval is certainly less than optimal. How much less? Ah, there’s the question. Let’s create copies of a table:
SELECT * INTO HeapTable FROM Sales.SalesOrderDetail AS sod; SELECT * INTO ClusterTable FROM Sales.SalesOrderDetail AS sod;
We could just compare the heap against the clustered index, but you know that won’t work well. Instead, let’s compare an index on the heap against the cluster. We’ll reuse the query from above as our test. On each table I’m going to create an index:
CREATE INDEX ixHeap ON HeapTable (ProductID) ; CREATE CLUSTERED INDEX ixCluster ON ClusterTable (ProductID) ;
You should already have some idea of how this will work out, but, just in case, here are the execution plans:
The queries both had a single scan, but the first query had 11 reads to get the data and the second had 3 reads to get the data. Again, you could make the other index covering:
CREATE INDEX ixHeap2 ON HeapTable (ProductID) INCLUDE (UnitPrice,OrderQty);
Then when the query is run, you get 2 reads instead of 3. But here’s a question. What happens when it comes time to defrag storage? Oh yeah, you can’t do that with a heap. So while you might be able to put a few indexes on to get good performance, what happens when the next query looks like this:
SELECT sod.UnitPrice, sod.OrderQty, sod.LineTotal FROM dbo.ClusterTable AS sod WHERE sod.ProductID = 927;
Right, the cluster still only has 3 reads, but now you’re back to 11, or more, on the heap.
This idea behind this post is to simply get you to think about your clustered index. There are endless debates about the exact, most perfect, method of using your cluster. My take is rather simple. You’re better off having one than not. Since you have to have one, make sure the system is using it, so place it on the most frequently access path that is also selective enough to support an index. I’ll leave it to the experts to debate the finer points.
See you for the next class in a couple of days, assuming no disruption in the space/time continuum or elder gods ripping apart the planet.