How to optimize correctly 20 indexes ?

  • Hi all,

    I have a table containing cells that I can access by 4 coordinates (one for the field (YPID) + CellX, CellY and CellZ for the 3 coords) and each cell has 20 "quality" values (PvQlt1 - PvQlt20 in real)

    Which means the structure is like this :

    The table has a fixed number of 808416 records. No entry will be added or removed at runtime, but the values Pv... will be updated from somewhere else, but this is not the what worries me right now.

    Now I can get to the reason of my question here : I have an .Net application that makes this kind of queries (about 6 times each 5 seconds) to retreive some values and display them into graphics :

    SELECT AVG(CellX) AS CellX, AVG(CellY) AS CellY, AVG(CellZ) AS CellZ, AVG(PvQlt1) AS PvQlt1

    FROM ConfigCell

    WHERE CellX BETWEEN 1 AND 140 AND CellZ >= -2 AND CellZ < 14 AND YPID = 'PYpa_6'

    GROUP BY (((CellX - 1)/1) + 1), (CellY/1)

    ORDER BY CellX, CellY, CellZ

    The only things that will change between requests are the values used in the where close (but always with theses columns) and the quality requested in the select part (here PvQlt1).

    When I ask SQL Server to show me the execution plan for this query, it suggests the following improvement :

    CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]

    ON [dbo].[ConfigCell] ([YPID],[CellX],[CellZ])

    INCLUDE ([CellY],[PvQlt1])

    And of course if I ask for the execution plan with the same query but for the quality 3, the execution plan optimisation will propose me the same query but with INCLUDE ([CellY],[PvQlt3]).

    What is the best way to index this table ?

    Do I have to make 20 identic non clustered index but each time including a diffrent quality, or is there a way to group all of them in the same index ?

    Thanks a lot for your time and answer,

    Best regards,

    Joël

  • If you want to keep the table as the same denormalized structure, change the clustered index to be on CellX, CellZ, and YPID. Notice I didn't say change the PK to this but you might be able to dump the PK column altogether and you might not need a PK if you can make the clustered index unique even if you have to add 1 more column (CellY?). Making the new clustered index unique would help tremendously with performance and resource usage. Like you said, the table is static and it's not like you're going to have any page splits due to inserts or mods. Because you'd be changing the clustered index instead of adding a wad o' other indexes, there will be a huge savings in disk space. If you were to simply drop your PK column, the total index size on this table would only need to cover the BTREE of the clustered index, which would be comparatively quite small.

    Also, what's up with the "divide by 1" thing in the GROUP BY? It looks to me like that's just burning clock cycles. In fact, it looks like the -1 and +1 thing are self cancelling, as well, and that the GROUP BY could be just GROUP BY CellX, CellY.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Thanks for the quick answer !

    I have several questions regarding your message, also maybe because I forgot to say that I'm quite a newbie to this whole SQL Server thing.

    First, why do you say my table has a denormalized structure ?

    Then, regarding the clustered index and the PK, you are right I could have avoided the use of an ID column because each row can be find by a unique combinaison of YPID, CellX, CellY and CellZ. I used an ID only for easier reference as foreign key on one other table but I can easily use 4 columns in this other table instead of the current cell ID.

    In fact, when I created the rows for this ConfigCell table I have generated the ID column based on these 4 others with something like CellX + 1000* CellY + 100000 * CellZ + 10000000 * YPID.

    So, to sum up, would you recommend to

    1) drop the ID column ?

    2) set the PK (clustered index) on YPID, CellX, CellY and CellZ ?

    And finally to answer your question about the divide by 1, it's true it this case it is not relevant but these are dynamic values and in this special case the value was 1.

    The VB.NET code used to generate the SELECT query is the following :

    sqlCommand = String.Format(

    "SELECT AVG([{1}]) AS {1}, AVG([{2}]) AS {2}, AVG([{3}]) AS {3}, AVG([{4}]) AS {4} " &

    "FROM [{0}] " &

    "WHERE [{1}] BETWEEN {5} AND {6} AND " &

    "[{3}] >= {7} AND [{3}] < {8} AND " &

    "[{9}] = {10} " &

    "GROUP BY ((([{1}] - {5})/{11}) + {5}), ([{2}]/{12}) " &

    "ORDER BY [{1}], [{2}], [{3}]",

    tConfigCell.TableName,

    tConfigCell.Fields.CellX.ToString, tConfigCell.Fields.CellY.ToString, tConfigCell.Fields.CellZ.ToString, qltField,

    subpile.CellDMin.CellX, subpile.CellDMax.CellX,

    CInt(layer.LowPos.Value), CInt(layer.HeighPos.Value),

    tConfigCell.Fields.YPID.ToString, tDBTable.PrepareStringForSQL(subpile.Parent.ID),

    grid.X, grid.Y)

    The values {11} and {12} used as divisor in the GROUP BY are the grid.X and grid.Y. Because I can choose custom "grid" (1x1, 3x3, 5x5, ...) to display on the graphs.

    The value {5} used in the GROUP BY ((((CellX - 1)/1) + 1) is the min value for the CellX (also used as the first argument in the BETWEEN) here again in this example it was 1 but on diffrent YardPart (YPID) it could be something else.

    Joël

  • Working from the bottom up on your response...

    Understood on the "Divide by one" thing. Parameterising that and using "1" for that particlular situation makes your code a lot easier that possibly having multiple decisions as to which formula to use if the value of "1" is used infrequently and you don't mind burning the occasional unnecessary clock cycles. Well done there! 🙂

    On the recommendations...

    1) drop the ID column ?

    "It Depends". I'm right there with you on using an IDENTITY column to make an easy unique reference to any given row even if there's an alternate unique key based on natural values especially when such an AK is a composite. If you decide to keep it (and I probably would based on what I just said), I wouldn't waste the clustered index on this mostly static table.

    2) set the PK (clustered index) on YPID, CellX, CellY and CellZ ?

    Just to be clear, the PK doesn't have to be the clustered index. That's just the default for when you don't specify and the clustered index hasn't been created yet.

    Also, yes, the clustered index should be moved to other columns but maybe not in the order that you have them. If you look at parameter {1} and {3} and {9} in the WHERE clause, they're column names and I would put the order of those columns in the clustered index from left to right in the most common order of selectivity with the most selective being on the left and the least selective being on the right.

    According to your original WHERE clause that you posted...

    WHERE CellX BETWEEN 1 AND 140 AND CellZ >= -2 AND CellZ < 14 AND YPID = 'PYpa_6'

    ... the CellY column didn't even enter the mix. I only included CellY in my recommendation because I think it will be necessary to make each row in the index unique. So far as I'm concerned, if a clustered index isn't unique and because all of the other indexes contain the columns of the clustered index behind the scenes, having a non-unique clustered index can cause a world of hurt insofar as performance and disk storage goes.

    Based on the original WHERE clause above, my recommendation for the clustered index would be CellX, CellZ, and YPID. If that's not a unique combination, then add CellY to the end of that. If that still doesn't make it unique, then post back and I'll tell you how to make it unique.

    Shifting gears to your first question, your PvQlt columns are what make the table denormalized. They can have NULL values according to your table definition and you can't search for a quality value unless you also know the quality column that the value might be in. That also means that if you have to do searches for quality values, you might have to index all 20 of those columns to get the required performance out of your columns. While it looks like it's more efficient storage wise as you have the table now, remember that indexes are basically a sorted duplication of data and that the clustered index will also be added to each non-clustered index. Having an index on each PvQlt column would possibly consume much more than if each PvQlt value were on a single row even if all the other non-PvQlt values needed to be repeated. Of course, that would also be a denormalization but it would make searches for PvQlt values a lot easier and faster. To bring this table to 3rd normal form, you'd have to split this table up into other tables quite a bit and that might be serious overkill but, again depending on what your searches will be, normalizing the PvQlt columns in a separate table using the IDENTITY column as the join value might be very worth while. It might help performance a bit, as well, because the base table would be able to fit more rows per page in the clustered index (which is actually the table data at the leaf level).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply