indexing help for staging table

  • Any help is appreciated, I'm relatively new to this.

    I am moving data from a large warehouse (hundreds of billions of rows) that I will be using for later analysis.

    I am grabbing all the data for a given day from the warehouse, moving it to a staging table, selecting only the data in the staging table for the customers I'm interested in, and then merging this into a final table that has the data I need. Each day I pull from the warehouse has about 20 million records. I then select the data for the 3 million customers I need and move that into the final table. I then repeat the loop by truncating the staging table and then pulling the next day's data from the warehouse.

    The reason I'm pulling the entire day's data (rather than just the customers I'm interested in is because this runs much faster given the way the warehouse is organized (i.e., by date rather than customer)).

    My question is, how should I be indexing the staging and final tables to make this quicker? Any suggestions? I suspect my question might be a little vague and I'd be happy to fill in details if necessary.

    Thanks in advance!

    B

  • brucecphillips (2/5/2011)


    Any help is appreciated, I'm relatively new to this.

    I am moving data from a large warehouse (hundreds of billions of rows) that I will be using for later analysis.

    I am grabbing all the data for a given day from the warehouse, moving it to a staging table, selecting only the data in the staging table for the customers I'm interested in, and then merging this into a final table that has the data I need. Each day I pull from the warehouse has about 20 million records. I then select the data for the 3 million customers I need and move that into the final table. I then repeat the loop by truncating the staging table and then pulling the next day's data from the warehouse.

    The reason I'm pulling the entire day's data (rather than just the customers I'm interested in is because this runs much faster given the way the warehouse is organized (i.e., by date rather than customer)).

    My question is, how should I be indexing the staging and final tables to make this quicker? Any suggestions? I suspect my question might be a little vague and I'd be happy to fill in details if necessary.

    When you say data warehouse is organized by "date", does it means particular table is partitioned by "date"?

    Any reason not to filter by Date AND Customer when you extract from the Data Warehouse?

    Assuming you cannot (for any clear or obscure reason) creating an index on your staging table mostly depends on cardinality meaning how many rows you are interested in on such a staging table, if the number is greather than 15% of the rows forget about an index and let the system do a full table scan, if the number is less than 15% you may indeed benefit from an index built on "customer" column.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Viewing 2 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply