Clustered Index on Multiple Fields

Question

Clustered Index on Multiple Fields

Vince Poirier

SSC Eights!

Points: 932
More actions
July 26, 2016 at 2:39 pm

#328666

Hello,
I have a big table, around 700 Million rows for 30 GBs.
The table is structured as such :
DateId
OtherId1
OtherId2
OtherId3
Value
Right now my table is a heap.
The only unicity is throught the use of every ID column (first 4 columns)
No column in itself allows to greatly filter the data.
The users are using the table in various and unmonitored way...
Not always by the date.
They're often grouping the values based on one ID or more.
What to do ?
I'm convinced that I need a Clustered Index but in which order ?
If I build a cluster in a certain order and a user is not using the fields accordingly, will his query still benefit from the Clustered index ?
Thanks,

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

Alan Burstein SSC Guru Points: 61136 More actions · Answer 1

the use of every ID column (first 4 columns)
No column in itself allows to greatly filter the data.
The users are using the table in various and unmonitored way...
Not always by the date.
They're often grouping the values based on one ID or more.
...What to do ?
I'm convinced that I need a Clustered Index but in which order ?

There's an old BI saying, "if you can't measure it, you can't manage it. " In the long run I'd encourage you to track the queries against this table. You can use DMVs or Extended Events to identify things like your most common and most expensive queries.

That said, I would consider adding a surrogate key (e.g. an identity column) and use the date and surrogate key as your clustered index keys. (Date first, surrogate key second).

Then, based on your analysis of the queries against that table, I'd add some nonclustered indexes to support the most common and/or expensive queries.

If I build a cluster in a certain order and a user is not using the fields accordingly, will his query still benefit from the Clustered index ?

No. But they will from nonclustered indexes if they'really designed correctly.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 2

AlanB, can you please elaborate why you suggest an identity column on this table? That is what almost all of my clients do on, well, every single table and many (including this one from what I know so far) just shouldn't have them.

Based on other posts of yours I have read I don't think you are a ninny, so what am I missing here? 😉

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Steve Jones - SSC Editor SSC Guru Points: 734449 More actions · Answer 3

My advice would be to see if you have some combination of unique fields that define what the entity is. If that's Date + OtherValue1, use that. Typical advice is look for something that satisfies range queries, though an identity can be used if you are looking at the insert/update/delete impact. Some advice here:

http://www.sqlskills.com/blogs/kimberly/the-clustered-index-debate-continues/

Note you haven't mentioned the PK, and I hope you have one. However, this doesn't need to be clustered, so your clustered index doesn't need to be unique.

Vince Poirier SSC Eights! Points: 932 More actions · Answer 4

That said, I would consider adding a surrogate key (e.g. an identity column) and use the date and surrogate key as your clustered index keys. (Date first, surrogate key second).

Very interresting point there... can you elaborate a little more?

I will have a sell job to do to DBAs

They seem to think that a surrogate key for that table will just add overhead and increase table size without any added value

What is the gain over using all the IDs ?

Then, based on your analysis of the queries against that table, I'd add some nonclustered indexes to support the most common and/or expensive queries.

Most IDs in that table contains very few values

There are 600 DateIds, 40 OtherId1, 350 OtherID2 and the OtherId3, for what I know, is rarely used in queries

My guess is that the nonclustered indexes would add little to no benefit over hashing

But I also take good note on your "Mesure and Manage" advice

Thanks a lot !!

Vince Poirier

Alan Burstein SSC Guru Points: 61136 More actions · Answer 5

Sorry - late, started a few times and got pulled away. I'll try to reply to Kevin's statement and Vince's question.

First, I have to reiterate my suggestion that you monitor how this table is queried; It's impossible to make good indexing descisions otherwise. You can accomplish this using DMVs, DMFs and extended events. I have included some links below on this.

Vince Poirier (7/28/2016)
That said, I would consider adding a surrogate key (e.g. an identity column) and use the date and surrogate key as your clustered index keys. (Date first, surrogate key second).
Very interresting point there... can you elaborate a little more?
I will have a sell job to do to DBAs
They seem to think that a surrogate key for that table will just add overhead and increase table size without any added value
What is the gain over using all the IDs ?

First why clustered index on date key?

Based on what you've told us and what we know - the date column is the one most likely to get you the most bang for the buck. You mentioned that some times the date column is used. Using the date column as your cluster key will, for example, improve queries where you're filtering, sorting and grouping by dates. I'll come back to this in a moment.

What is the gain over using all the IDs ?

You want your cluster key to be as narrow as possible, especially on a table with 70M rows. The more narrow, the easier to maintain and this will also reduce the size and overhead related to your nonclustered indexes (note that nonclustered indexes use the cluster key as a pointer.

More important is that the order by which your cluster keys are sorted will determine if the optimizer can utilize that index. For example, if most of your queries filter on col2 and col2 is not the first key in your clustered index then the optimizer will be forced to do a scan instead of a seek. In this case you have now added the overhead of a clustered index with little or no benifit. If N is the number of columns that make up your cluster key, the number of possible combinations of how you arrange it is N! (factoral of N). For four columns that's 24 ((col1, col2, col3, col4), (col1,col2,col4,col3)....(col4, col3,col2,col1)). In other words you have a 1 in 24 chance of picking the best arrangement.

Why the date key/identity combination? doesn't the identity column add overhead?

This is also in response to Kevin's comment. Yes - I agree 100% that identity columns for the sake of having one is a bad practice. Clustered keys don't have to be unique but it's a best practice IMHO. Note this article by Kimberly Tripp[/url].

If you don't include an identity column SQL server will add a hidden one for you behind the scenes, it's referred to as a uniquifier. A uniquifier is 4 bytes, the same as an int identity column so, no, no real added overhead compared to just using the date column.

Here's some queries you can use to better understand the benefit of using the date column as your cluster key. Note my comments.

1,000,000 rows of sample data

USE tempdb

GO

-- some sample data

CREATE TABLE dbo.SomeTable

(

SomeDate date NOT NULL,

SomeNumber int NOT NULL

);

INSERT dbo.SomeTable

SELECT TOP(1000000) -- 1M rows of sample data

DATEADD(DAY, CHECKSUM(newid())%1000, getdate()-2000),

CHECKSUM(newid())%1000+1

FROM sys.all_columns a, sys.all_columns b;

Now run this:

-- Run these with "include actual execution plan" turned on

-- WITHOUT a clustered index each of these is resolved using a table scan against a heap

SET STATISTICS IO ON;

SELECT * FROM SomeTable WHERE SomeDate BETWEEN '20100101' AND '20101231';

SELECT MAX(someNumber) FROM SomeTable WHERE SomeDate = '20100101';

SET STATISTICS IO OFF;

Table 'SomeTable'. Scan count 1, logical reads 1977, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Now add this Identity column and clustered index:

ALTER TABLE dbo.SomeTable

ADD SomeID int identity NOT NULL;

CREATE CLUSTERED INDEX cl_SomeDate ON dbo.SomeTable(SomeDate,SomeID);

Now run the query from earlier...

Table 'SomeTable'. Scan count 1, logical reads 457, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'SomeTable'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

To better understand why the date column is a good choice, look at this. Let's say you needed to do some grouping. Run this query with "include actual execution plan" on:

-- Say your people group by month or year or both

-- Note how this query gets an expensive hash match

SELECT mo = MONTH(SomeDate), yr = YEAR(SomeDate), AVG(SomeNumber)

FROM dbo.SomeTable

GROUP BY MONTH(SomeDate),YEAR(SomeDate);

Now let's add an indexed view:

CREATE VIEW dbo.SomeTableByMonth

WITH SCHEMABINDING AS

SELECT SomeID, mo = MONTH(SomeDate), yr = YEAR(SomeDate), SomeNumber

FROM dbo.SomeTable;

GO

CREATE UNIQUE CLUSTERED INDEX uci_SomeTableByMonth ON dbo.SomeTableByMonth(mo,yr,SomeID);

Now run the same query from above and compare the execution plan.

This sums up my thinking on why the date/identity column for your cluster key might be the best choice.

Links on how to monitor your queries...

Performance Tuning With SQL Server Dynamic Management Views.

http://www.johnsansom.com/how-to-identify-the-most-costly-sql-server-queries-using-dmvs/[/url]

http://sqlmag.com/blog/performance-tip-find-your-most-expensive-queries

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Vince Poirier SSC Eights! Points: 932 More actions · Answer 6

Thanks very much Alan,

This is very useful and explains a lot.

I now have eveything I need to convince the DBAs.

Of course we will also make more tests in that direction.

Thanks and Regards,

Vince