Clustered Index on Multiple Fields

  • Hello,

    I have a big table, around 700 Million rows for 30 GBs.

    The table is structured as such :

    DateId

    OtherId1

    OtherId2

    OtherId3

    Value

    Right now my table is a heap.

    The only unicity is throught the use of every ID column (first 4 columns)

    No column in itself allows to greatly filter the data.

    The users are using the table in various and unmonitored way...

    Not always by the date.

    They're often grouping the values based on one ID or more.

    What to do ?

    I'm convinced that I need a Clustered Index but in which order ?

    If I build a cluster in a certain order and a user is not using the fields accordingly, will his query still benefit from the Clustered index ?

    Thanks,

  • the use of every ID column (first 4 columns)

    No column in itself allows to greatly filter the data.

    The users are using the table in various and unmonitored way...

    Not always by the date.

    They're often grouping the values based on one ID or more.

    ...What to do ?

    I'm convinced that I need a Clustered Index but in which order ?

    There's an old BI saying, "if you can't measure it, you can't manage it. " In the long run I'd encourage you to track the queries against this table. You can use DMVs or Extended Events to identify things like your most common and most expensive queries.

    That said, I would consider adding a surrogate key (e.g. an identity column) and use the date and surrogate key as your clustered index keys. (Date first, surrogate key second).

    Then, based on your analysis of the queries against that table, I'd add some nonclustered indexes to support the most common and/or expensive queries.

    If I build a cluster in a certain order and a user is not using the fields accordingly, will his query still benefit from the Clustered index ?

    No. But they will from nonclustered indexes if they'really designed correctly.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • AlanB, can you please elaborate why you suggest an identity column on this table? That is what almost all of my clients do on, well, every single table and many (including this one from what I know so far) just shouldn't have them.

    Based on other posts of yours I have read I don't think you are a ninny, so what am I missing here? 😉

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • My advice would be to see if you have some combination of unique fields that define what the entity is. If that's Date + OtherValue1, use that. Typical advice is look for something that satisfies range queries, though an identity can be used if you are looking at the insert/update/delete impact. Some advice here:

    http://www.sqlskills.com/blogs/kimberly/the-clustered-index-debate-continues/

    Note you haven't mentioned the PK, and I hope you have one. However, this doesn't need to be clustered, so your clustered index doesn't need to be unique.

  • That said, I would consider adding a surrogate key (e.g. an identity column) and use the date and surrogate key as your clustered index keys. (Date first, surrogate key second).

    Very interresting point there... can you elaborate a little more?

    I will have a sell job to do to DBAs

    They seem to think that a surrogate key for that table will just add overhead and increase table size without any added value

    What is the gain over using all the IDs ?

    Then, based on your analysis of the queries against that table, I'd add some nonclustered indexes to support the most common and/or expensive queries.

    Most IDs in that table contains very few values

    There are 600 DateIds, 40 OtherId1, 350 OtherID2 and the OtherId3, for what I know, is rarely used in queries

    My guess is that the nonclustered indexes would add little to no benefit over hashing

    But I also take good note on your "Mesure and Manage" advice

    Thanks a lot !!

    Vince Poirier

  • Sorry - late, started a few times and got pulled away. I'll try to reply to Kevin's statement and Vince's question.

    First, I have to reiterate my suggestion that you monitor how this table is queried; It's impossible to make good indexing descisions otherwise. You can accomplish this using DMVs, DMFs and extended events. I have included some links below on this.

    Vince Poirier (7/28/2016)


    That said, I would consider adding a surrogate key (e.g. an identity column) and use the date and surrogate key as your clustered index keys. (Date first, surrogate key second).

    Very interresting point there... can you elaborate a little more?

    I will have a sell job to do to DBAs

    They seem to think that a surrogate key for that table will just add overhead and increase table size without any added value

    What is the gain over using all the IDs ?

    First why clustered index on date key?

    Based on what you've told us and what we know - the date column is the one most likely to get you the most bang for the buck. You mentioned that some times the date column is used. Using the date column as your cluster key will, for example, improve queries where you're filtering, sorting and grouping by dates. I'll come back to this in a moment.

    What is the gain over using all the IDs ?

    You want your cluster key to be as narrow as possible, especially on a table with 70M rows. The more narrow, the easier to maintain and this will also reduce the size and overhead related to your nonclustered indexes (note that nonclustered indexes use the cluster key as a pointer.

    More important is that the order by which your cluster keys are sorted will determine if the optimizer can utilize that index. For example, if most of your queries filter on col2 and col2 is not the first key in your clustered index then the optimizer will be forced to do a scan instead of a seek. In this case you have now added the overhead of a clustered index with little or no benifit. If N is the number of columns that make up your cluster key, the number of possible combinations of how you arrange it is N! (factoral of N). For four columns that's 24 ((col1, col2, col3, col4), (col1,col2,col4,col3)....(col4, col3,col2,col1)). In other words you have a 1 in 24 chance of picking the best arrangement.

    Why the date key/identity combination? doesn't the identity column add overhead?

    This is also in response to Kevin's comment. Yes - I agree 100% that identity columns for the sake of having one is a bad practice. Clustered keys don't have to be unique but it's a best practice IMHO. Note this article by Kimberly Tripp[/url].

    If you don't include an identity column SQL server will add a hidden one for you behind the scenes, it's referred to as a uniquifier. A uniquifier is 4 bytes, the same as an int identity column so, no, no real added overhead compared to just using the date column.

    Here's some queries you can use to better understand the benefit of using the date column as your cluster key. Note my comments.

    1,000,000 rows of sample data

    USE tempdb

    GO

    -- some sample data

    CREATE TABLE dbo.SomeTable

    (

    SomeDate date NOT NULL,

    SomeNumber int NOT NULL

    );

    INSERT dbo.SomeTable

    SELECT TOP(1000000) -- 1M rows of sample data

    DATEADD(DAY, CHECKSUM(newid())%1000, getdate()-2000),

    CHECKSUM(newid())%1000+1

    FROM sys.all_columns a, sys.all_columns b;

    Now run this:

    -- Run these with "include actual execution plan" turned on

    -- WITHOUT a clustered index each of these is resolved using a table scan against a heap

    SET STATISTICS IO ON;

    SELECT * FROM SomeTable WHERE SomeDate BETWEEN '20100101' AND '20101231';

    SELECT MAX(someNumber) FROM SomeTable WHERE SomeDate = '20100101';

    SET STATISTICS IO OFF;

    Table 'SomeTable'. Scan count 1, logical reads 1977, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SomeTable'. Scan count 1, logical reads 1977, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Now add this Identity column and clustered index:

    ALTER TABLE dbo.SomeTable

    ADD SomeID int identity NOT NULL;

    CREATE CLUSTERED INDEX cl_SomeDate ON dbo.SomeTable(SomeDate,SomeID);

    Now run the query from earlier...

    Table 'SomeTable'. Scan count 1, logical reads 457, physical reads 0, read-ahead reads 5, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SomeTable'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    To better understand why the date column is a good choice, look at this. Let's say you needed to do some grouping. Run this query with "include actual execution plan" on:

    -- Say your people group by month or year or both

    -- Note how this query gets an expensive hash match

    SELECT mo = MONTH(SomeDate), yr = YEAR(SomeDate), AVG(SomeNumber)

    FROM dbo.SomeTable

    GROUP BY MONTH(SomeDate),YEAR(SomeDate);

    Now let's add an indexed view:

    CREATE VIEW dbo.SomeTableByMonth

    WITH SCHEMABINDING AS

    SELECT SomeID, mo = MONTH(SomeDate), yr = YEAR(SomeDate), SomeNumber

    FROM dbo.SomeTable;

    GO

    CREATE UNIQUE CLUSTERED INDEX uci_SomeTableByMonth ON dbo.SomeTableByMonth(mo,yr,SomeID);

    Now run the same query from above and compare the execution plan.

    This sums up my thinking on why the date/identity column for your cluster key might be the best choice.

    Links on how to monitor your queries...

    Performance Tuning With SQL Server Dynamic Management Views.

    http://www.johnsansom.com/how-to-identify-the-most-costly-sql-server-queries-using-dmvs/[/url]

    http://sqlmag.com/blog/performance-tip-find-your-most-expensive-queries

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks very much Alan,

    This is very useful and explains a lot.

    I now have eveything I need to convince the DBAs.

    Of course we will also make more tests in that direction.

    Thanks and Regards,

    Vince

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply