Clustered Index strategy

  • I have a single table database. The table holds health claims information (service codes and dates, providers, consumers) from another agency. It is updated once a month by inserting ~100k - 200k new rows (it currently has 10 million). Existing rows are never updated. I am going to move it to another server and am trying to figure out the best clustered index.

    It seems that any column that would be useful for querys would cause massive page splits, since the data is so diverse for any given month. If I used the month/year paid, it would cause no splits since that is the way the data is sent to us. We do query on (between) service begin and end dates. The claims do come in somewhat in order by those dates.

    What does anyone suggest?

    TIA

    John

    John Deupree

  • If the table is inserted into once a month only, it is not your bottleneck. Nevetheless, you could use a proper fillfactor to avoid page splits (say, 85%). After each load you would recreate indexes on the table and run other kinds of maintenance.

    Because SELECT is the most frequent and the primary usage of the table, I would put a clustered index on the service begin and end dates (maybe only begin date?). I think testing will help you to finalize the decision.

  • That sounds like a good strategy to me....

  • It would help to know the table structure to see if anything else can a better choice. Also what items are common for selections and I do agree your fill factor is important.

    My thoughts are you may want to use an Identity column though and add a minimal number of non-clustered indexes for query coverage. The reason is the clustered index will be based on INDENTITY and FILLFACTOR can remain 100% (or 0) since it will be in order of insert. The non-clustered index will take the main item for the index (say you use Social) then add the corresponding INDENTITY values. Under the hood looks like this.

    ROOT/LEAFs of Non-Clustered then final leaf will have

    SSN CorrespondingIDENTITY

    a 1

    a 5

    a 500

    b 4

    b 200

    b 1000

    c 48

    c 52

    d 3

    in order making searches quick.

    Then it takes the corresponding IDENTITies and finds in the Clustered index.

    Now the big concern is performance of new inserts once a month. Drop all non-clustered indexes before insert (keeping CLUSTERED at 100% because inserts and page splits will always be last page), when inserts complete rebuild or defrag clustered index to tighten up then end (see explaination below) and finally readd non-clustered indexes.

    Ok now the reason why you leave clustered index at 100% then rebuild/defrag when done. IDENTITY values always inset at the end of the cluster. As a page fills it will split (50% old page, 50% new). Well then the new page will accept inserts until full then split again. Which depending on size of table could cause page splits thru leaf pages up thru root which root split causes to be new leaf pages and a new root is built. When done you will have lots of partial pages so do the clean up and then add the non-clustered indexes (100% full as not worried about data changes until next insert date) back and be good to go with good performance.

  • Well said.

  • Thanks for the reply. I'll try it.

    John

    quote:


    It would help to know the table structure to see if anything else can a better choice. Also what items are common for selections and I do agree your fill factor is important.

    My thoughts are you may want to use an Identity column though and add a minimal number of non-clustered indexes for query coverage. The reason is the clustered index will be based on INDENTITY and FILLFACTOR can remain 100% (or 0) since it will be in order of insert. The non-clustered index will take the main item for the index (say you use Social) then add the corresponding INDENTITY values. Under the hood looks like this.

    ROOT/LEAFs of Non-Clustered then final leaf will have

    SSN CorrespondingIDENTITY

    a 1

    a 5

    a 500

    b 4

    b 200

    b 1000

    c 48

    c 52

    d 3

    in order making searches quick.

    Then it takes the corresponding IDENTITies and finds in the Clustered index.

    Now the big concern is performance of new inserts once a month. Drop all non-clustered indexes before insert (keeping CLUSTERED at 100% because inserts and page splits will always be last page), when inserts complete rebuild or defrag clustered index to tighten up then end (see explaination below) and finally readd non-clustered indexes.

    Ok now the reason why you leave clustered index at 100% then rebuild/defrag when done. IDENTITY values always inset at the end of the cluster. As a page fills it will split (50% old page, 50% new). Well then the new page will accept inserts until full then split again. Which depending on size of table could cause page splits thru leaf pages up thru root which root split causes to be new leaf pages and a new root is built. When done you will have lots of partial pages so do the clean up and then add the non-clustered indexes (100% full as not worried about data changes until next insert date) back and be good to go with good performance.


    John Deupree

  • We actually are working on the same thing. We have about 20 million claims records in the file. We also get monthly data drops of data.

    Please contact me at sahmed@mangocorp.com. We may be able to share some war stories...

    Regards,

    Shabbir Ahmed

  • Why not to use a partinioned view?

    If you are putting new records once per month a new table/month could be good candidate. So you could keep the insert and update times at a constant value and also speed up the your selects as long those selects are hitting one month period.

    ANd after you could also think about distributed partinnioned views



    Bye
    Gabor

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply