SSIS Data Flow "bulk insert" VS. INSERT INTO

  • We are currently pulling in 1+million rows of data for a monthly process, all records processed in the prior month. This data is coming from others sources like DB2. We take this data from the source and use a SSIS data flow task to put the data into a 'staging' table. We then take this table and do some data 'fixing' and converting in a SQL step. At the end of this SQL step we do an INSERT INTO from this 'staging' table to our monthly table that contains history. I'm told that this process is creating a large amount of logging because of the INSERT INTO. And we are being advised to do any data 'fixing' or converting of the data in the SSIS data flow task and then insert these records into the monthly table instead of the 'staging' table.

    Current flow:

    Source table --> SSIS Data flow(copy data) --> 'Staging' table --> SQL(data fix) and INSERT INTO monthly table

    Proposed flow:

    Source table --> SSIS Data flow(data fix) --> monthly table

    My concerns are:

    1) That doing it this way will be harder to detect errors, or lost data.

    2) take longer to fix.

    3) take longer or harder to apply any new 'fixes'.

    4) the process may take longer to run.

    Is the issue of the amount of logging a huge problem from a DBA standpoint?(I'm not a DBA) I would think you would want this type of data logged. Why wouldn't the data flow task log this? How are others handling this type of scenario? Any other ways to handle this?

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • the logging they are referring to is transnational logging. It is eating up disk space and making your backups larger.

    you could keep the process the same, but INSERT INTO could be broken down into say 10000 rows at a time with a while loop wrapped around a begin tran and commit tran.

  • below86 (3/13/2014)


    We are currently pulling in 1+million rows of data for a monthly process, all records processed in the prior month. This data is coming from others sources like DB2. We take this data from the source and use a SSIS data flow task to put the data into a 'staging' table. We then take this table and do some data 'fixing' and converting in a SQL step. At the end of this SQL step we do an INSERT INTO from this 'staging' table to our monthly table that contains history. I'm told that this process is creating a large amount of logging because of the INSERT INTO. And we are being advised to do any data 'fixing' or converting of the data in the SSIS data flow task and then insert these records into the monthly table instead of the 'staging' table.

    Current flow:

    Source table --> SSIS Data flow(copy data) --> 'Staging' table --> SQL(data fix) and INSERT INTO monthly table

    Proposed flow:

    Source table --> SSIS Data flow(data fix) --> monthly table

    My concerns are:

    1) That doing it this way will be harder to detect errors, or lost data.

    2) take longer to fix.

    3) take longer or harder to apply any new 'fixes'.

    4) the process may take longer to run.

    Is the issue of the amount of logging a huge problem from a DBA standpoint?(I'm not a DBA) I would think you would want this type of data logged. Why wouldn't the data flow task log this? How are others handling this type of scenario? Any other ways to handle this?

    Create a "staging" database in the SIMPLE recovery model and tell the DBAs not to worry about backing it up.

    As a bit of a sidebar as a DBA, I frown heavily on anyone that does direct inserts from file to final table directly. There should be (must be) a datavalidation step in a staging table somewhere, so far as I'm concerned.

    As for a "large amount of logging", ostensibly the INSERT INTO is only doing a million rows and the rows have already been cleaned'n'gleened. If the daabase can't handle such a relatively small number of rows, there's something seriously wrong with the box or the table. For example, if the table has 137 columns and there's an index on each column or several wide covering indexes on many columns, that could be the partial cause of all the "extra" logging.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the reply Jeff, I was thinking that putting the data to a 'staging' table made more sense to me too.

    Some stats, the table has about 75 columns. There are 17 constraints to default values for fields, I did not know this until now. There are 23 indexes set up on this one table, one is clustered the others are not. We used to only have 9 indexes, the latest DBA added the other 14. He also set up the new ones and changed the old ones to have 'Included Columns'. One index has 20 fields included, at least 6 others have at least 10 columns included. Would this cause extra logging?

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • I should have mentioned that it's not just this example I mentioned, we have a lot more tables where we are doing the same steps. Extract data from a source data repository, SSIS data flow to move the data to a staging table and then doing SQL INSERT INTO tables after the data has went through any cleaning or modified in the staging table. The concern I'm hearing is the overall size of the transactional log. And that doing the 'Insert' into these tables using the SSIS dataflow does not add to the transactional log where the INSERT INTO SQL statement does.

    Again my fear is adding development time using the SSIS data flow, and additional time required to fix any production issues. To me, not a DBA, it would be easier and more effective to deal with the transaction log issues.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • below86 (7/15/2014)


    The concern I'm hearing is the overall size of the transactional log. And that doing the 'Insert' into these tables using the SSIS dataflow does not add to the transactional log where the INSERT INTO SQL statement does.

    Whoever told you this doesn't understand how logging works.

    The destination, being filled with data already, can't do a minimally logged transaction (similar to what a staging table usually gets). This leads to the idea that the logging is always minimal off SSIS. That's untrue.

    The INSERT from SSIS OleDB Destination matches an INSERT INTO, just the data is a memory stream instead of a different source. It will still log exactly the same amount.

    You can take the staging out of the target. You can take the load off that server if you translate in SSIS on a different one. You can't take the load off the logging from the INSERT, not unless you swap the recovery type on the database, which I personally don't recommend doing without fully understanding the impact of swapping back and forth from full and bulk logged. Even then, it may not help.

    Your DBAs have to understand that the INSERT is Cost of Business. There's no avoiding it with SSIS, SSRS, SSAS, Quantum theory, Slytherin magic, or the power of faith. You either don't log (AKA: Mess with the recovery model), or you eat the log.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig F (7/15/2014)


    below86 (7/15/2014)


    The concern I'm hearing is the overall size of the transactional log. And that doing the 'Insert' into these tables using the SSIS dataflow does not add to the transactional log where the INSERT INTO SQL statement does.

    Whoever told you this doesn't understand how logging works.

    The destination, being filled with data already, can't do a minimally logged transaction (similar to what a staging table usually gets). This leads to the idea that the logging is always minimal off SSIS. That's untrue.

    The INSERT from SSIS OleDB Destination matches an INSERT INTO, just the data is a memory stream instead of a different source. It will still log exactly the same amount.

    You can take the staging out of the target. You can take the load off that server if you translate in SSIS on a different one. You can't take the load off the logging from the INSERT, not unless you swap the recovery type on the database, which I personally don't recommend doing without fully understanding the impact of swapping back and forth from full and bulk logged. Even then, it may not help.

    Your DBAs have to understand that the INSERT is Cost of Business. There's no avoiding it with SSIS, SSRS, SSAS, Quantum theory, Slytherin magic, or the power of faith. You either don't log (AKA: Mess with the recovery model), or you eat the log.

    To be sure and to add to what Craig laid out above, unless you can meet all minimally logged requirements, even shifting to the SIMPLE recovery mode won't help. It will still log all of the individual rows data inserts in the transaction log even if it drops them from the log after the transaction commits.

    My recommendation is similar to what Craig has recommended... do your ETL in a separate database in the SIMPLE recovery mode. Once the data is "right" in a staging table, then do the final INSERT (or whatever) on the final production database.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Evil Kraig F (7/15/2014)


    Your DBAs have to understand that the INSERT is Cost of Business.

    Currently we don't have one. The one that suggested this quit.

    Thanks Craig and Jeff. You both are confirming the way I thought this worked, I just didn't have a way to prove it.

    One of the other developers on my team said he tested this theory out on a 'test' server and when he ran the INSERT INTO statement he saw the log grow. Then he shrank the log and did the SSIS load, with no increase in the log. I have no idea what server, I'm pretty sure it was a 2012, or even what database or table(s) he used, or how they were set up. I'll see if I can get some more details on what was done.

    I know all the tables we are going to INSERT INTO are already going to have data in them, no empty tables, the 'staging' tables will be empty. I'm guessing all have indexes set up, most will only have non clustered indexes, the 'Prod' tables, not the staging.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • When he did his test, did he use the same source data? When he did, did he remove the new data (or changes) from the target table? I assume your staging to real components does some form of delta comparison. If the second load is the exact same as the first, your delta compare would have nothing to write, thus, no logging.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Evil Kraig F (7/16/2014)


    When he did his test, did he use the same source data? When he did, did he remove the new data (or changes) from the target table? I assume your staging to real components does some form of delta comparison. If the second load is the exact same as the first, your delta compare would have nothing to write, thus, no logging.

    I would assume he cleared the data and then reloaded with the same data, but I will need to talk to him to find out, he's not here today.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • I got a little more detail on how this was tested today. The first thing that was done was to create a new database and a new empty table on that database. Then SQL was ran to show tha amount of space the new table took up and what the size of the log was. Then a simple SQL statement was executed from an SSIS package that would do an INSERT INTO the table created earlier, approx. 3 million rows added. The SQL to show the file and log sizes was ran again, the log file increased a lot, say from 1024 to 8500, I don't remember the exact number. SQL was then ran to delete the new table and database, then the create database and table ran again, and again the size SQL for table and log files. It was back down to the 1024. Then from the same SSIS package a dataflow task was ran that had the SQL to select the data, but the flow 'inserted' the data into the same table as before. After the dataflow task was ran the SQL was ran to get the size of the table and log files, the log file was unchanged this time, or just up to 1025 I think.

    So if it didn't log this information here, meaning on this database with this table, did it log it somewhere else? Or is it true then that no logging takes place?

    I'm going to try and get the SQL of everything that was ran and I can post it later.

    Thanks for your help.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • From MSDN,

    "For a database under the full recovery model, all row-insert operations that are performed by bulk import are fully logged in the transaction log. "

    http://msdn.microsoft.com/en-us/library/ms190422(v=sql.105).aspx

    A Bulk Insert is used with a fast-load option in the data flow's Ole Db destination, if that is what your team is using.

    ----------------------------------------------------

  • I'm having issues again posting SQL code here.

    But thanks for the link. This tells me for most of what we do some logging is still going to occur, tables with existing data and indexes. And all logging where inserting into a table with clustered indexes.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • SQL to create the database and table:

    DROP DATABASE db_work

    ;

    CREATE DATABASE db_work

    ;

    USE db_work;

    CREATE TABLE target_table

    (

    pNum CHAR(100),

    pMod INT

    )

    ;

    The Insert Into statement:

    INSERT INTO db_work.dbo.target_table

    SELECT pNum,

    pMod

    FROM source_db.dbo.source_table

    ;

    The log file size check SQL:

    SELECTGETDATE() AS checkDate,

    a.name,

    a.database_id,

    create_date,

    b.type_desc file_type,

    b.name AS db_file_name,

    b.size AS db_file_size

    FROM sys.databases a

    INNER JOIN sys.master_files b ON

    a.database_id = b.database_id

    WHERE a.name = 'db_work'

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

  • Results after creating table and database:

    checkDatenamedatabase_idcreate_datefile_typedb_file_namedb_file_size

    2014-11-19 09:42:22.460db_work1152014-11-19 09:41:46.597ROWSdb_work4096

    2014-11-19 09:42:22.460db_work1152014-11-19 09:41:46.597LOGdb_work_log1024

    Results after insert into:

    checkDatenamedatabase_idcreate_datefile_typedb_file_namedb_file_size

    2014-11-19 09:43:58.207db_work1152014-11-19 09:41:46.597ROWSdb_work19456

    2014-11-19 09:43:58.207db_work1152014-11-19 09:41:46.597LOGdb_work_log76216

    I then deleted the table and database, then recreated them. Ran the SQL again to get log size, same results as first results shown above.

    I ran this same log size SQL multiple times while the data flow task was running and I did not see the db_work_log size grow, stayed at 1024.

    Here are the results after the data flow:

    checkDatenamedatabase_idcreate_datefile_typedb_file_namedb_file_size

    2014-11-19 09:49:12.830db_work1152014-11-19 09:45:01.010ROWSdb_work17280

    2014-11-19 09:49:12.830db_work1152014-11-19 09:45:01.010LOGdb_work_log1024

    Just wanted to give some examples of what was done and the results.

    My conclusion, based off what others have said and from Microsoft is this. If the table is empty with no indexes then using the data flow will not add to the log size,but the insert into will. But if the table has data or has indexes then some logging will occur, if there is a clustered index then full logging will occur.

    I ran the data flow task again without clearing the data, log file did not grow, still 1024. I thought it would.

    I added a non-clustered index, did not run any load, the log file grew to 3936.

    Did not run clear, ran data flow again, log file size is still 3936.

    Deleted index, log size 3936.

    Added clustered index, log size now 4776.

    Ran data flow again, data not cleared before, log size now 4776. I was expecting this to grow. I don't understand why this did not grow. Anyone have any ideas as to why it didn't? Based on what was in the link to Microsoft I was expecting ot to still grow. Would it not grow because I just kept adding the same data? I never changed the source table in any of the test.

    -------------------------------------------------------------
    we travel not to escape life but for life not to escape us
    Don't fear failure, fear regret.

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply