DWH : Fact table : index creation.

  • Below is the schema of my FACT table

    table : Fact.POLICY

    Size : 165 GB

    Policy_transactional_ID (surrogate key )

    Policy_ID

    Audit_Trail_ID (FK)

    Journalization_ID (FK)

    Program_ID (FK)

    Product_ID (FK)

    Producer_ID (FK)

    Surplus_Line_ID (FK)

    Policy_ID (FK)

    Account Number

    Acquisition Costs

    monthly cost

    total_gain_producer_amt

    Amount

    Attachment Point

    Base Amount

    '

    '

    '

    '

    '

    '

    starttime

    Endtime

    approach 1 :

    Policy_transactional_ID identity , not in use of any query , should i make it PK with non clu index

    clus index : policy_id + starttime

    and other non clus indexes on ( every ID column + starttime )

    approach 2 :

    Policy_transactional_ID : PK + clus index

    and other non clus indexes on ( every ID column + starttime )

    people say we should go with case 1 , but why to aligned/sort the data physical on column which is not being used in queries. my take would be case 2

    please guide

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • It depends, what is the ETL process, is it a snapshot Fact (as is) or incremental with Insert new update existing, or Contra correction logic as the decision will have a big impact on which route you go down.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Jason-299789 (2/15/2013)


    It depends, what is the ETL process, is it a snapshot Fact (as is) or incremental with Insert new update existing, or Contra correction logic as the decision will have a big impact on which route you go down.

    Agreed +1

    Putting the clustered index on the startdate column could speed up your select statements providing they use startdate but will also slow down your inserts. In addition I would look at the queries that are hitting this table and think do you really need all those non clustered indexes if yes then fair enough but worth doing the check just to see? 🙂

    If you are doing a nightly load rather than an incremental you could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

    Cheers

    Andy

    ==========================================================================================================================
    A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

  • Jason-299789 (2/15/2013)


    It depends, what is the ETL process, is it a snapshot Fact (as is) or incremental with Insert new update existing,

    it will be "incremental with Insert new update existing".

    And how much indentity column as surrogate key will help the other non clus index although SK will not be part of any query.

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Andy Hyslop (2/15/2013)


    could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

    Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Bhuvnesh (2/18/2013)


    Andy Hyslop (2/15/2013)


    could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

    Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

    Sorry for this answer, but it does depend on the volume of data..

    I assume you were referring to the nightly loads (as dropping indexes for an incremental will carry more overhead than its worth)..

    For a nightly load yes dropping and re-creating the indexes will have an impact but you will need to try and assess both strategies and see which performs better i.e. perform two loads one with indexes and one dropping and re-creating the indexes and test..

    In my experience dropping and recreating generally works faster and has the added benefit of generating a fresh index..

    Andy

    ==========================================================================================================================
    A computer lets you make more mistakes faster than any invention in human history - with the possible exceptions of handguns and tequila. Mitch Ratcliffe

  • Andy Hyslop (2/18/2013)


    Bhuvnesh (2/18/2013)


    Andy Hyslop (2/15/2013)


    could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

    Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

    Sorry for this answer, but it does depend on the volume of data..

    I assume you were referring to the nightly loads (as dropping indexes for an incremental will carry more overhead than its worth)..

    For a nightly load yes dropping and re-creating the indexes will have an impact but you will need to try and assess both strategies and see which performs better i.e. perform two loads one with indexes and one dropping and re-creating the indexes and test..

    In my experience dropping and recreating generally works faster and has the added benefit of generating a fresh index..

    Andy

    +1, but would add that there are only two indexes I wouldn't drop, the first is the clustered index, the other is the covering Business key index for the ETL process, especially on tables that are substantial.

    But would agree its very much a suck it and see which works best for you, try it without the drop indexes then try the same load with drop indexes, see which one gives you the best solution.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Bhuvnesh (2/18/2013)


    Andy Hyslop (2/15/2013)


    could test your ETL's by dropping all of your indexes and re-creating and determining the impact so see which is more efficient

    Does it cost intensive if we dropp the indexes and rebuild them every time we upload the data ?

    Dropping the indexes during ETL would mean misserable performance for users querying the table, also adding overhead wher building index back - former means potential performance impact for other queries.

    In regards to indexing strategy, here are my two cents.

    Clustered unique index supporting your PK

    Non-clustered, non-unique indexes on every FK

    Other non-clustered, non-unique indexes depending on actual queries.

    Hope this helps.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Under no circumstances should you cluster this table on Policy_transactional_ID.

    Cluster the table by the most common WHERE conditions; make the clustered key unique by including the Policy_transactional_ID on the end of the clus key but only if necessary.

    [You can add a separate unique nonclustered PK if you absolutely need it, although I don't think you should need it in this type of situation.]

    Besides, the date added should be ascending anyway, other than corrections. Sort the new data being added into the clustering key sequence.

    Keep in mind: a given row is INSERTed only once, but SELECTed thousands (or millions!) of times.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher (2/18/2013)


    Under no circumstances should you cluster this table on Policy_transactional_ID.

    Cluster the table by the most common WHERE conditions; make the clustered key unique by including the Policy_transactional_ID on the end of the clus key but only if necessary.

    Lets take an scenario,

    i have PK on Policy_transactional_ID and most of the queries are using starttime and policy_number in WHERE clause

    then which option will work better for clustered index.

    approach 1 : starttime + policy number (sql server itself add RID to make the records unique)

    approach 2 : srarttime + policy number + Policy_transactional_ID.

    and which apprach will have more overhead from space and resource perspective.

    -------Bhuvnesh----------
    I work only to learn Sql Server...though my company pays me for getting their stuff done;-)

  • Scott,

    I would disagree with you on the under no circumstances comment, for a couple of reasons

    1) the risk of fragmentation is significant on the clustered index, especially if it uses an alphanumeric.

    2) On DW's you don't run targeted queries, such as show me Policy X, you run Show me all Policies written in the last timeframe.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Bhuvnesh (2/18/2013)


    ScottPletcher (2/18/2013)


    Under no circumstances should you cluster this table on Policy_transactional_ID.

    Cluster the table by the most common WHERE conditions; make the clustered key unique by including the Policy_transactional_ID on the end of the clus key but only if necessary.

    Lets take an scenario,

    i have PK on Policy_transactional_ID and most of the queries are using starttime and policy_number in WHERE clause

    then which option will work better for clustered index.

    approach 1 : starttime + policy number (sql server itself add RID to make the records unique)

    approach 2 : srarttime + policy number + Policy_transactional_ID.

    and which apprach will have more overhead from space and resource perspective.

    I would think that starttime + policy number would inherently be unique. If it is unique, just define the index as UNIQUE yourself and SQL won't add a "uniquifier".

    If it's not unique, either 1 or 2 will be OK; most people would probably prefer 2 just because it's clearer what is happening, since you can explicitly specify UNIQUE in the index.

    The combined clustered index will take slightly more diskspace for the non-leaf clustered index levels on ( starttime, policy_number [, policy_transactional_id] ) than on just ( policy_transactional_id ). However, if the proper clustered index prevents you from having to create a second, nonclustered index on ( starttime, policy_number ), then you've saved disk space anyway, but much more importantly, you've saved additional I/O every time a row is deleted, inserted or updated.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Jason-299789 (2/19/2013)


    Scott,

    I would disagree with you on the under no circumstances comment, for a couple of reasons

    1) the risk of fragmentation is significant on the clustered index, especially if it uses an alphanumeric.

    2) On DW's you don't run targeted queries, such as show me Policy X, you run Show me all Policies written in the last timeframe.

    You're simply wrong.

    "Show me all Policies written in the last timeframe"

    Exactly! So the table should be clustered by timeframe in that case, not some dopey meaningless identity column.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply