A few reflections on RBAR by a weary application developer

  • rstone (12/11/2013)


    Another issue I've seen is RBR -> RBAR. If something somplace requires processing a row at a time, then that's the way it's all done. For example, sending email. All the pre-processing can be set based until it's time to send the mail.

    Correct. I actually used such an example in that long writeup I made above.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Marius.D (12/11/2013)


    Someone said something like "Always do what makes sense". I vote for that!

    It never stops amazing me how "solutions" are discussed as if they would work universally. Most of them, don't! Most of the time, nobody even asks if we're talking about an OLTP environment or a data warehouse?

    RBAR SSIS may work for you, or not. Let's say you have a data warehouse that you update at night, and it is only in use during the day for custom reporting. What's wrong with SSRS (RBAR) during the night in this case? Nothing. Quite the opposite: when done right (as part of a framework), when you come in in the morning and something went wrong, you know what the error was, it's in your email, you know how many rows were written before it aborted, and, after fixing it, restarting the package will automatically delete the rows written before the error and start over, as well running subsequent packages that are waiting...

    What's wrong with RBAR at night? The answer is, it takes extra time and resources to execute. Nightly runs have to finish before business opens for the day and there's only so much time and so many resources available.

    As for the suggestion that RBAR is the only way to know what went wrong at the row level, nothing could be further from the truth. Even BULK INSERT has switches that can be set to allow an unlimited number of errors and to sequester the bad rows with an explanation by column as to why each row failed while still allowing the rest of the run to complete. And SSRS, SSIS, or any of the other 4 letter words in SQL Server don't need to be RBAR just to report what the errors are.

    To tell the truth, though, there should be 0 failures all the time even if the data is bad because the data should be imported into a staging table, prevalidated, and marked as either good or bad and what the reason is. A failure shouldn't ever find you. You should find the failure before it happens especially when it comes to the act of loading data.

    This is all a part of what I've been talking about. People don't know what they don't know and resort to RBAR because that's what they know. For people in the business of doing the things you mention, they should buckle down and learn the tools so that their processes can work in a timely and resource thoughtful manner. And, you can't make a resource run 12,000% faster by throwing hardware at it. It takes knowledge and practice. I recommend people spend some time getting a whole lot more of both before succumbing to RBAR and calling it "done". Without such knowledge, you don't actually have the skills to determine what actually "makes sense". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/11/2013)


    Marius.D (12/11/2013)


    Someone said something like "Always do what makes sense". I vote for that!

    It never stops amazing me how "solutions" are discussed as if they would work universally. Most of them, don't! Most of the time, nobody even asks if we're talking about an OLTP environment or a data warehouse?

    RBAR SSIS may work for you, or not. Let's say you have a data warehouse that you update at night, and it is only in use during the day for custom reporting. What's wrong with SSRS (RBAR) during the night in this case? Nothing. Quite the opposite: when done right (as part of a framework), when you come in in the morning and something went wrong, you know what the error was, it's in your email, you know how many rows were written before it aborted, and, after fixing it, restarting the package will automatically delete the rows written before the error and start over, as well running subsequent packages that are waiting...

    What's wrong with RBAR at night? The answer is, it takes extra time and resources to execute. Nightly runs have to finish before business opens for the day and there's only so much time and so many resources available.

    As for the suggestion that RBAR is the only way to know what went wrong at the row level, nothing could be further from the truth. Even BULK INSERT has switches that can be set to allow an unlimited number of errors and to sequester the bad rows with an explanation by column as to why each row failed while still allowing the rest of the run to complete. And SSRS, SSIS, or any of the other 4 letter words in SQL Server don't need to be RBAR just to report what the errors are.

    To tell the truth, though, there should be 0 failures all the time even if the data is bad because the data should be imported into a staging table, prevalidated, and marked as either good or bad and what the reason is. A failure shouldn't ever find you. You should find the failure before it happens especially when it comes to the act of loading data.

    This is all a part of what I've been talking about. People don't know what they don't know and resort to RBAR because that's what they know. For people in the business of doing the things you mention, they should buckle down and learn the tools so that their processes can work in a timely and resource thoughtful manner. And, you can't make a resource run 12,000% faster by throwing hardware at it. It takes knowledge and practice. I recommend people spend some time getting a whole lot more of both before succumbing to RBAR and calling it "done". Without such knowledge, you don't actually have the skills to determine what actually "makes sense". 😉

    I didn't mean to imply use SSIS because it's the only way to find errors, I meant to say SSIS has very good ways to deal with errors. In our case, all the processing takes about 1.5 hours, so there's plenty of time until the morning. I only see errors once in a great while when someone has changed something (DDL, unfortunately, yes, it happens). Totally agree with everything you said, and very nice post, thank you!

  • Jeff asks, "What's wrong with RBAR at night?". Even shorter answer is 24 hour processing. What night? My night is one clients early morning.

    ATBCharles Kincaid

  • Part of the issue is definitely people falling into what they know. It's conceptually easy for someone coming from writing C for instance to find the cursor and say "hey, a looping structure, I understand that" and apply it where it shouldn't be.

    And some are loathe to change that once it's set in. That's when I have to tell people, "Just because you can, doesn't mean that you should."

    And I think it should be pronounced AhrBahr. 😉

  • john.moreno (12/9/2013)


    RBAR is slow for the same reason a bubble sort is slow -- it does expoentially more work than is necessary.

    PHYData DBA (12/9/2013)


    Do this in any language or program and you will reduce the optimal execution rate exponentially according to the size of the dataset being inspected.

    PHYData DBA (12/9/2013)


    The idea is that any DataSet (notice the word SET) update in any language can be optimized with Set based code and slowed down exponentially with RBAR code.

    Would you please stop abusing the word exponential.

    It really wrecks the credibility of your otherwise valid comments when you get this concept so fundamentally wrong.

    The fact is that updating N rows is an O(N) operation using either RBAR or set based techniques.

    This means they differ only by a constant factor.

    Yes, a rather large constant factor - but still constant.

    (It might help if the difference truly were exponential, because then RBAR solutions would start falling flat long before 100 rows and there would be no argument.)

  • Jeff Moden (12/11/2013)


    To tell the truth, though, there should be 0 failures all the time even if the data is bad because the data should be imported into a staging table, prevalidated, and marked as either good or bad and what the reason is. A failure shouldn't ever find you. You should find the failure before it happens especially when it comes to the act of loading data.

    One import error I think we may have experienced occasionally with a data import is the insertion of a row with a primary key which duplicates one already in a database table. This is not a case of bad data if you look at the import in isolation (it would pass validation), and theoretically even if you tested the import data in a staging table you might still get an error if a duplicate keyed row was entered in your database before the import run was completed.

    Identifying these duplicate rows is therefore a requirement, and your later post on bulk insert switches and error file output should be most helpful (I must have been searching on the wrong terms in Books Online).

  • craig 81366 (12/13/2013)


    Would you please stop abusing the word exponential.

    It really wrecks the credibility of your otherwise valid comments when you get this concept so fundamentally wrong.

    The fact is that updating N rows is an O(N) operation using either RBAR or set based techniques.

    This means they differ only by a constant factor.

    Yes, a rather large constant factor - but still constant.

    (It might help if the difference truly were exponential, because then RBAR solutions would start falling flat long before 100 rows and there would be no argument.)

    With "Hidden RBAR", such as a Triangular Join in an aggregated correlated subquery to calculate running totals, the difference actually IS exponential. Please see the following article on the subject. It's one of the first I ever wrote.

    http://www.sqlservercentral.com/articles/T-SQL/61539/

    That being said, I do agree with you. RBAR is normally "just" O(N).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Tavis Reddick (1/9/2014)


    Jeff Moden (12/11/2013)


    To tell the truth, though, there should be 0 failures all the time even if the data is bad because the data should be imported into a staging table, prevalidated, and marked as either good or bad and what the reason is. A failure shouldn't ever find you. You should find the failure before it happens especially when it comes to the act of loading data.

    One import error I think we may have experienced occasionally with a data import is the insertion of a row with a primary key which duplicates one already in a database table. This is not a case of bad data if you look at the import in isolation (it would pass validation), and theoretically even if you tested the import data in a staging table you might still get an error if a duplicate keyed row was entered in your database before the import run was completed.

    Identifying these duplicate rows is therefore a requirement, and your later post on bulk insert switches and error file output should be most helpful (I must have been searching on the wrong terms in Books Online).

    Thanks for the feedback, Tavis.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I see that you do not have to do RBAR to get row error details for BULK INSERT from a data file.

    Is there a set-based/declarative/T-SQL way of getting row error details for an ordinary INSERT? In other words, when moving data within SQL Server?

  • Tavis Reddick (1/9/2014)

    One import error I think we may have experienced occasionally with a data import is the insertion of a row with a primary key which duplicates one already in a database table. This is not a case of bad data if you look at the import in isolation (it would pass validation), and theoretically even if you tested the import data in a staging table you might still get an error if a duplicate keyed row was entered in your database before the import run was completed.

    Identifying these duplicate rows is therefore a requirement, and your later post on bulk insert switches and error file output should be most helpful (I must have been searching on the wrong terms in Books Online).

    If your data import procedure locks the target table just prior to checking the staging table for keyed rows already in the target table and holds the lock during the import of "good" rows, the error won't occur. Of course, that is at the expense of reduced concurrency which arguably isn't going to be that great anyway if you're inserting a large number of rows.

  • Tavis Reddick (1/9/2014)


    I see that you do not have to do RBAR to get row error details for BULK INSERT from a data file.

    Is there a set-based/declarative/T-SQL way of getting row error details for an ordinary INSERT? In other words, when moving data within SQL Server?

    Kind of... if you're just trying to prevent duplicates based on some combination of columns, you could create a unique index on those column(s) with an IGNORE DUPLICATES directive on the index.

    Other than that, you could use the method that I use. With very rare exception, I load data into a staging table first, validate the rows, mark them for INSERT, UPDATE, error, etc, and then move the data to the final table with the appropriate WHERE clause to catch the last second updates to the target table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 106 through 116 (of 116 total)

You must be logged in to reply to this topic. Login to reply