Can't Import Data Fast Enough - Optimize or rethink the design?

  • Hey Jeff, as always thanks for your advice.

    Sounds like I need to create a new table with the right schema and move all the existing data into it as trying to make these changes with the existing data will take hrs (days?) becaue there are so many rows.

    The problem with waiting until the log file rolls is that some servers will take 6+ hrs to see enough activity to roll the log file whereas some server are rolling every 30-90 meetings, plus many times we need the data available within 15-30min after it happened. I like the idea of adding a merge flag, but wouldn't I still have to search the main table to see if exists to set the merge flag, i.e. I would still pay the cost.

    also, what SQL statements do you use to move the data from staging to main table? I do an INSERT INTO with a nested SELECT, but I have no idea if that is the most efficient way to do it.

  • winterminute (9/27/2009)


    Hey Jeff, as always thanks for your advice.

    Sounds like I need to create a new table with the right schema and move all the existing data into it as trying to make these changes with the existing data will take hrs (days?) becaue there are so many rows.

    The problem with waiting until the log file rolls is that some servers will take 6+ hrs to see enough activity to roll the log file whereas some server are rolling every 30-90 meetings, plus many times we need the data available within 15-30min after it happened. I like the idea of adding a merge flag, but wouldn't I still have to search the main table to see if exists to set the merge flag, i.e. I would still pay the cost.

    also, what SQL statements do you use to move the data from staging to main table? I do an INSERT INTO with a nested SELECT, but I have no idea if that is the most efficient way to do it.

    There's no way to move data without INSERT INTO SELECT. The key to efficiency will be what the clustered index is and whether the data being inserted will cause any page splits.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I don't create an indexes on my staging table, should I? Once I implement a merge flag, then it would make sense to add an index for that one, but right now I just import in the same order they are in the log file and then do a INSERT INTO/SELECT from the staging table without any order clause.

    What do logic do you use to set your Merge flag? i.e. You've just imported x rows into your staging, now you need to figure out if they already exist, and I assume you set a column that you then key off later.

    It was really intersting to see the execution plan in action. I prototyped your suggestion on an empty table. However, without all the data from the other table, I think the perf numbers are misleading so I can't really compare the prototype execution plan to the existing one. For example, on the new table, when I insert one log file the PrimaryKey cost becomes 84%. But still fun to compare different keys/indexes/

    our idea (IdColumn=Clustered Index, 3 Column composite PK=NonClustered) = ClusterdIndexInsert cost 84%

    My idea, (IdColumn=PK=ClusteredINdex) and the ClusterdIndexInsert cost was 87%.

    So, you definately know your stuff 🙂

  • Dupe Post - Not sure why...

  • winterminute (9/27/2009)


    I don't create an indexes on my staging table, should I? Once I implement a merge flag, then it would make sense to add an index for that one, but right now I just import in the same order they are in the log file and then do a INSERT INTO/SELECT from the staging table without any order clause.

    What do logic do you use to set your Merge flag? i.e. You've just imported x rows into your staging, now you need to figure out if they already exist, and I assume you set a column that you then key off later.

    It was really intersting to see the execution plan in action. I prototyped your suggestion on an empty table. However, without all the data from the other table, I think the perf numbers are misleading so I can't really compare the prototype execution plan to the existing one. For example, on the new table, when I insert one log file the PrimaryKey cost becomes 84%. But still fun to compare different keys/indexes/

    our idea (IdColumn=Clustered Index, 3 Column composite PK=NonClustered) = ClusterdIndexInsert cost 84%

    My idea, (IdColumn=PK=ClusteredINdex) and the ClusterdIndexInsert cost was 87%.

    So, you definately know your stuff 🙂

    If all you have is a clustered index on the staging table, guess where all of the costs on an INSERT will be? Most of that cost stuff means nothing in relation to actual performance. I can show you where something that shows up as 0 cost takes all of the time and something that shows up as 100 percent of the cost takes no time.

    The merge flag is simple... you simply do an update on the staging table based on a WHERE EXISTS just like you would if you were doing the actual insert to the final table. The key here is that the temp table is much smaller and preconditioning the data in this manner will be much faster.

    So far as indexes go on the merge flag... just about useless because of the low cardinality. You can try it if you want but I think it may actually slow things down.

    So far as other indexes on the temp table goes, you need to play with whether they actually help or not. Indexes on columns for joins to the main table to determine what the merge column will contain may help performance a bit... or not. A 50K row temp table almost doesn't need an index because it'll usually fit in memory (it's only 10MB). But, like I said, play with it. Just don't add the indexes to the temp table until AFTER the data has been imported.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply