Inserting rows a row at a time instead of in batches- RBAR is faster???

  • I was speaking with a work colleague and the topic of inserts came up.

    I had suggested doing the inserts in batches. We were taking data from an OLTP table on prod and copying it to a temp working/staging table for another process in another database. (Not SSIS or BCP, just a stored procedure that was doing some work)

    Anyway, the colleague said, "No, do it row by row, don't grab batches of 10K or so rows, row by row is faster".

    I explained I had never experienced that and they discussed various scenarios in the past where they found that to be the case on other systems.

    Now, the batch insert was faster for us in this case, but I'm trying to imagine where row by row could possibly be faster?

    Maybe updates and inserts... MAYBE because it might lock out parts of a busy table... but row by row inserts?

    Any thoughts what would make RBAR faster than batches?

    Thoughts I had:

    1. extreme pressure on the log file so it has trouble logging 10K rows but is able to squeeze in a few rows here or there between other operations?

    2. very active table where any sort of lock is causing issues or the inserts are being inserted "in the middle" instead of "at the end" (say new rows for last quarter not new orders for today) so locking or blocking is occurring?

    3. Heavy indexing where the new rows are causing index splits or maybe page splits???

    4. Database is autogrowing and batch inserts are triggering more autogrows more rapidly (maybe it is odd like autogrow 1 MB??) so they can do 100 single rows before autogrow vs a single 10K batch forces many autogrow events???

    Really can't imagine why that would ever be the case?

    I always operated under RBAR is bad and doing things in batches or set operations is always better (save massive data loads in which case you batch to avoid locking out an entire table or blocking other OLTP operations because your insert takes a long time if you do all million/billion rows, etc...)?

  • Batching DML operations will always be faster than doing it one row at a time. However, keep in mind that a batch is a single transaction, and splitting up a large batch (ie: 1,000,00+ rows) into smaller batches (ie: 100,000) will typically improve performance greatly due to minimizing pressure on the transaction log.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (9/23/2015)


    Batching DML operations will always be faster than doing it one row at a time. However, keep in mind that a batch is a single transaction, and splitting up a large batch (ie: 1,000,00+ rows) into smaller batches (ie: 100,000) will typically improve performance greatly due to minimizing pressure on the transaction log.

    Which makes good sense, I can't think of an example of the opposite where a single row done 10,000 times is faster than one batch of 10,000. (Or whatever batch size)

  • Maxer (9/23/2015)


    Eric M Russell (9/23/2015)


    Batching DML operations will always be faster than doing it one row at a time. However, keep in mind that a batch is a single transaction, and splitting up a large batch (ie: 1,000,00+ rows) into smaller batches (ie: 100,000) will typically improve performance greatly due to minimizing pressure on the transaction log.

    Which makes good sense, I can't think of an example of the opposite where a single row done 10,000 times is faster than one batch of 10,000. (Or whatever batch size)

    If you're dealing with an appropriately well tuned server, you shouldn't find those often at all. I have seen a few rare cases where the batch version might trigger auto-growth to kick in (e.g. a large batch that has to dump a bunch of stuff into tempDB), which maked it *look* like the batch version is slower.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (9/23/2015)


    Maxer (9/23/2015)


    Eric M Russell (9/23/2015)


    Batching DML operations will always be faster than doing it one row at a time. However, keep in mind that a batch is a single transaction, and splitting up a large batch (ie: 1,000,00+ rows) into smaller batches (ie: 100,000) will typically improve performance greatly due to minimizing pressure on the transaction log.

    Which makes good sense, I can't think of an example of the opposite where a single row done 10,000 times is faster than one batch of 10,000. (Or whatever batch size)

    If you're dealing with an appropriately well tuned server, you shouldn't find those often at all. I have seen a few rare cases where the batch version might trigger auto-growth to kick in (e.g. a large batch that has to dump a bunch of stuff into tempDB), which maked it *look* like the batch version is slower.

    Actually mega inserts aren't anywhere near as problematic as mega updates and deletes. If you kick off a script on Friday afternoon to delete 100,000,000 rows from a billion row table, then by Monday morning your Inbox will be full of server outage alerts, the script will still be running (or rolling back), and you'll be crying under your desk.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Maxer (9/23/2015)


    I was speaking with a work colleague and the topic of inserts came up.

    I had suggested doing the inserts in batches. We were taking data from an OLTP table on prod and copying it to a temp working/staging table for another process in another database. (Not SSIS or BCP, just a stored procedure that was doing some work)

    Anyway, the colleague said, "No, do it row by row, don't grab batches of 10K or so rows, row by row is faster".

    I explained I had never experienced that and they discussed various scenarios in the past where they found that to be the case on other systems.

    Now, the batch insert was faster for us in this case, but I'm trying to imagine where row by row could possibly be faster?

    Maybe updates and inserts... MAYBE because it might lock out parts of a busy table... but row by row inserts?

    Any thoughts what would make RBAR faster than batches?

    Thoughts I had:

    1. extreme pressure on the log file so it has trouble logging 10K rows but is able to squeeze in a few rows here or there between other operations?

    2. very active table where any sort of lock is causing issues or the inserts are being inserted "in the middle" instead of "at the end" (say new rows for last quarter not new orders for today) so locking or blocking is occurring?

    3. Heavy indexing where the new rows are causing index splits or maybe page splits???

    4. Database is autogrowing and batch inserts are triggering more autogrows more rapidly (maybe it is odd like autogrow 1 MB??) so they can do 100 single rows before autogrow vs a single 10K batch forces many autogrow events???

    Really can't imagine why that would ever be the case?

    I always operated under RBAR is bad and doing things in batches or set operations is always better (save massive data loads in which case you batch to avoid locking out an entire table or blocking other OLTP operations because your insert takes a long time if you do all million/billion rows, etc...)?

    There IS one place where it looks like RBAR is faster than "Set Based" and that's when you compare runs of Recursive CTE's with well written WHILE loops that use an explicit transaction around the WHILE loop. Keep in mind that Recursive CTEs that "count" or do something to each an every row aren't actually Set Based. They just look like it because there's no explicit WHILE.

    Another area might be where TRIANGULAR Joins are involved. For example, a WHILE loop will blow the doors off of most running balance attempts that use TRIANGULAR joins giving, once again, the illusion that RBAR is faster than true Set Based code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Piling on a bit, there is no shortage of examples where set based batch operations are faster than row by row operations, given that "everything" is fine within the system. In the much rarer cases when the procedural/row by row methods are faster, one normally finds some anomalies in the system such as missing indices, low parallelism threshold etc. etc.

    😎

  • Proof is in the pudding always. Try the set based and measure. Then try the rbar and measure. Especially when the advice sounds odd.

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

  • MadAdmin (12/5/2015)


    Proof is in the pudding always. Try the set based and measure. Then try the rbar and measure. Especially when the advice sounds odd.

    Make the person that made the claim prove it. Saves time on your part and gives you a chance to see what they think is good code. That way you're also testing the person and not just the method.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply