A Case FOR Cursors...

  • g.britton (6/1/2015)


    I think you've proved the main point. Using a CURSOR is usually the wrong way to go. Now, if it's decreasing the impact on other users, that's where SNAPSHOT_ISOLATION can help.

    I'm afraid you've lost me. I went back to this test, altered the database to allow snapshot isolation, and did this:

    set transaction isolation level snapshot

    update dbo.foobar set foo = foo - 1

    It took longer and still gave me a pretty big blip. So... how could I do the update set-based in a way that would not impact my example user queries so badly?

    Or did you mean that my example user queries should have been using snapshot isolation? For sake of argument, what if they were coming from a third-party app I couldn't control?

    Or did you mean something else entirely?

  • ericksrm (6/1/2015)


    Alan.B (6/1/2015)


    Amen. It always comes down to performance.

    Alan, totally agree with you that "it always comes down to performance". I threw together a quick and dirty example to illustrate a case where an iterative method is faster.

    First create the working dataset:

    CREATE TABLE dbo.TestData ( Id BIGINT IDENTITY(1,1) PRIMARY KEY , TestCol BIT )

    INSERT dbo.TestData ( TestCol ) SELECT 0

    WHILE ( SELECT COUNT(1) FROM dbo.TestData ) <= 500000000

    INSERT dbo.TestData (TestCol) SELECT TestCol FROM dbo.TestData

    CREATE TABLE dbo.TestData_History ( StartTime DATETIME2, EndTime DATETIME2, RowsUpdated INT )

    (Originally I had aspirations to wait for 500-million records... but halted this insert script after about 68-million)

    Now let's perform one massive update and track the execution time:

    DECLARE @newVal BIT = 1, @startTime DATETIME2

    SET @startTime = GETDATE()

    UPDATE TestData SET TestCol = @newVal

    INSERT dbo.TestData_History ( StartTime, EndTime, RowsUpdated ) SELECT @startTime, GETDATE(), @@ROWCOUNT

    Locally this 68-million record update took 153-sec (over 2 minutes).

    Now let's perform a series of smaller updates, each limited to 10,000 records:

    DECLARE @newVal BIT = 0, @startTime DATETIME2, @min-2 INT, @max-2 INT, @maxID INT

    SELECT @startTime = GETDATE(), @min-2 = 1, @max-2 = 10001, @maxID = MAX(ID) FROM dbo.TestData

    WHILE (@min < @maxID)

    BEGIN

    UPDATE dbo.TestData SET TestCol = @newVal WHERE Id >= @min-2 AND Id < @max-2

    INSERT dbo.TestData_History ( StartTime, EndTime, RowsUpdated ) SELECT @startTime, GETDATE(), @@ROWCOUNT

    SELECT @min-2 = @max-2

    ,@max = @max-2 + 10000

    ,@startTime = GETDATE()

    END

    The entire process took 113 seconds (LESS than 2 minutes).

    But there are more important things to note. How do you define "performance"? Working with very-large OLTP databases, it is even more important to me that (1) end-users are not impacted by blocking, and (2) downstream systems, including transactional replication, are not affected. Performing the single large update resulted in my TestData table blocking all other update operations for over 2-minutes. But even more scary is this single update caused my tranaction log to swell over 21-GB, which is going to cause massive issues to the replication architecture downstream. Basically: (1) I'll be lucky if i have enough disk to cover my 21-GB swell, (2) I'll be lucky if I have enough disk to cover the swell that the Distribution database is about to be subjected to, (3) i'll be lucky if replication doesn't fall so far behind that I need to reinitialize my subscribers (and thus bringing down my read-only copies), (4) I'll be lucky if my customers are not impacted by my Publisher database which is completely blocked, AND my Subscriber databases which are out-of-sync and (5) I'll be lucky if I have a job tomorrow!!!

    And this was a simple update statement. Can you imagine if my update involved some sort of GROUP BY, which would then throw TEMP_DB into the mix? 🙂

    On the other hand, updating in small batches meant that each update of 10,000 records finished in 10-20 milliseconds each. This allows concurrent transactions to execute in-between each smaller update. My transaction log (after shrinking it to the original 2-GB after the massive update finished) did not have to grow a single time. The log reader job for transactional replication is happy because it works better with smaller transactions. And, maybe I can take a vacation this summer because I'm still employed!

    In my opinion, "blanket statements" suck. They are never 100% applicable to everything. Cursors are not THE DEVIL. They are improperly used in many cases, but sometimes (especially in very-large OLTP datasets) they are the best way to go. I wish I only had to worry about 10-million record tables........ unfortunately many of our tables are around the 500-million mark, and I find more and more proper (and necessary) uses of cursors as the size continues to grow.

    Hope you all enjoy this example... shoot me a message if you have any suggestions.

    Thank you for this example, it is very good and is actually how I would handle a large update or insert including using a variable or parameter for the the batch size (as the best batch size may vary depending on the data you are dealing with and the system you are using). Your example clearly illustrates where the procedural code in SQL Server can be used to increase performance.

    I don't think that we are talking about the same thing here though. By iterative or RBAR I am talking about processing each row, one at a time. Note my example a few comments back with my example of how to create test data.

    I wanted to reply to the rest of your comment but I'm under the gun here 😉

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Have you enabled snapshot isolation on the db?

    ALTER DATABASE MyDatabase

    SET ALLOW_SNAPSHOT_ISOLATION ON

    more here: Snapshot Isolation in SQL Server

    Oh, I see you are!

    Anyway, what this does is allow continued good performance by other users, since they are not blocking on the table you are updating.

    Gerald Britton, Pluralsight courses

  • Thank you for writing this. I used to ask a cursor question when interviewing candidates and I got the same answer every time “you shouldn’t use cursors, they are bad”. I found it to be extremely frustrating. Developers use that as an excuse to never learn how to write cursors or when to use them. I have seen people write very un-maintainable, convoluted code (that might not always work correctly), or resort to manual processing in places where a cursor is an easy, clean an elegant solution.

  • Lenochka (6/1/2015)


    Thank you for writing this. I used to ask a cursor question when interviewing candidates and I got the same answer every time “you shouldn’t use cursors, they are bad”. I found it to be extremely frustrating. Developers use that as an excuse to never learn how to write cursors or when to use them. I have seen people write very un-maintainable, convoluted code (that might not always work correctly), or resort to manual processing in places where a cursor is an easy, clean an elegant solution.

    Sounds like the question needs to be reworded rather than thrown away. Turn it into a discussion question about the appropriate times to use a cursor. If the candidate says they are never appropriate that is an indication they might not be very strong in database maintenance. That doesn't mean the candidate should be excluded but it would give you a better understanding of their expertise. If they answered that for multiple database maintenance tasks they are the best option but for typical dml operations they should be avoided you may have somebody who understands their true usefulness. 😉

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • g.britton (6/1/2015)


    Anyway, what this does is allow continued good performance by other users, since they are not blocking on the table you are updating.

    Brian J. Parker (6/1/2015)


    Or did you mean that my example user queries should have been using snapshot isolation? For sake of argument, what if they were coming from a third-party app I couldn't control?

    It might seem tortured, but you did want an example.

  • This also got me thinking of another real-world example I've faced. Let's say you have a third-party product that includes a really complicated stored procedure that does a lot of hard-to-understand wizardry. The product will call it from its GUI, passing in a couple of values. Imagine:

    EXEC spComplicatedMagic @Acct='3478623', @Amt='7.13', @Magic='FOO'

    Your employer can save a ton of work, it realizes, by importing a bunch of (Acct, Amt, Magic) rows in from a flat file (sent over by a customer) and calling the stored procedure for each set of values... otherwise some poor shmoe has to manually type them in to the GUI, one by one, probably making mistakes.

    It think that you would have a hard time doing a set-based operation that worked better than importing that flat file to a table and using a cursor to call the inscrutable stored procedure for each row. On each call of the procedure you catch errors and log to the table, maybe setting a flag as to its disposition.

    Again, perhaps a tortured example, but it's pretty close to what I've dealt with. It's not common but it does provide a rare example where a cursor makes sense.

  • James Stephens (6/1/2015)

    I think you grossly misunderstood my point. I was not saying that all set based operations should be replaced by cursor-loops. My point could be distilled to "Just because you can, doesn't mean you should".

    There are certain things that are simply easier understood by most people (even Sql-Gods) than overly complex sql statements containing complex business logic. I happen to work in the real world, and yes I've seen the other comments about re-evaluating your hiring practices, make sure you don't have a "befuddled" successor, etc. Great. Wish I lived in that world where everyone was as perfect as you seem to think you are (on second thought, no I don't).

    I'd rather have a project that's easily maintainable and less prone to a mistake in a business-rule change by "the new guy" and have it cost a few more nanoseconds than one that is only maintainable by the highest paid top sql-jockeys (to throw in another term for your Monday amusement).

    There are simply some things that fall into the category I mentioned and to refuse to acknowledge that is arrogant. Those of us down here in the trenches of the real world recognize that, and some of us actually know what we're doing. Sorry you disagree.

    Jim

    When you say "Just because you can, doesn't mean you should" are you implying that SQL Developers initially think in terms of loops and then arbitrarily decide to write set-based "just because they can"?

    What are the "certain things" you claim are easier to understand then "overly complex sql statements containing complex business logic"? Overly complex loops containing complex business logic?

    For the sake of argument, let's say looping code is easier to understand and maintain (I disagree). Do you think it's responsible to write inefficient code in the hopes it is less likely to be broken by someone who doesn't know what they are doing?

    As I said before, there are lots of things you can and should do to make your code more readable/maintainable. For example, I like to put business rules in comments near the code that supports them. If that business rule changes, the code that needs to be changed is more readily found.

    I'm in the trenches, too, often refactoring poorly written cursor-based code to be set-based because of problems it is causing in the "real world". The irony is that often most of the work has already been done! There's a select statement right there in the cursor definition. Look-up codes and business rules are all right there, too.

    I wish I had the luxury of refactoring all of the cursors out of that third-party application, but every change needs to be tested. Spending time re-coding and testing production code that's not causing trouble can be a hard sell to management. I suspect that might be one of the reasons Microsoft still has cursors in SQL Server itself.

    Regarding hiring practices, I wrote test questions for SQL developer job applicants. One of the questions asked how to improve the performance of example code that used a cursor loop. Applicants who couldn't write set-based code didn't get hired. The applicants that were hired weren't highly paid sql-jockeys, either.

    Don't recall saying I thought I was perfect. I've learned a lot over the years from this site though and will always remember the time I posted an example where I was convinced a cursor had to be used. I learned about set-based approaches using tally tables that day and also learned about what a wonderful resource this site is. There's a lot to be gained from the "SQL-God/Peacock/Jockeys" who post regularly here!

  • Brian J. Parker (6/1/2015)


    g.britton (6/1/2015)


    Actually, I challenge that claim. Please post some actual results.

    Alan.B (6/1/2015)


    Until someone, anyone, can post an example where a cursor or other iterative method was faster than a well-written set-based solution I will remain convinced that the set-based solution is the way to go.

    Just for kicks, I thought I'd respond to these as a thought exercise.

    First, create a sample table with a bunch of random data:

    set nocount on

    create table dbo.foobar (foo int)

    declare @x int = 1

    while @x <= 100000 begin

    insert into dbo.foobar (foo) values (round(rand()*1000000,0))

    set @x = @x + 1

    end

    Now, let's create a little activity on the system. This simulates imaginary users who are querying this ugly table on a regular basis and it will keep running until you stop it:

    declare @now time, @now_string char(12), @before time, @delay int, @dummy int

    while 1=1 begin

    select @dummy = count(*) from dbo.foobar where foo = (round(rand()*1000000,0))

    set @now = GetDate(); set @now_string = cast(@now AS char(12)); set @delay = datediff(ms, @before, @now)-2000

    RAISERROR ('It is now %s, query took approximately %i ms', 0, 1, @now_string, @delay) WITH NOWAIT

    set @before = @now

    waitfor delay '00:00:02'

    end

    I get this started on my test server and see a bunch of delays in the 15-30 ms range.

    Now, I open a second window, and do a big set-based update of all rows:

    update dbo.foobar set foo = foo - 1

    The update itself is almost instant; but when I look back at the first window, I see a blip, a delay of 1314 ms. Our users noticed a little hiccup there!

    Now I do the same thing with a cursor:

    declare @dummy int

    declare curfoo cursor local forward_only

    for select foo from dbo.foobar

    for update of foo

    open curfoo; fetch next from curfoo into @dummy

    while @@fetch_status = 0 begin

    update dbo.foobar set foo = foo - 1 where current of curfoo

    fetch next from curfoo into @dummy

    end

    close curfoo; deallocate curfoo

    This is way slower, it takes, like, 30 seconds to run; however, over in my original window, the delays go up only slightly: they're about 25 ms to 40 ms.

    I would conclude from this what has been intuitively suggested: although a cursor is way less efficient at doing a given operation, it might have lower impact on others. I guess whether you would call this "faster" is debatable, though... faster for others on the system, not for the cursor operation. And I imagine someone here might be able to suggest a better set-based solution, or point out that I should be using a different isolation level...? (Seriously! I mean, this could be a learning experience for me.)

    So I had some time to run your test, and the results were similar: The update was instantaneous (723 ms for 100K rows) using the ol' fashioned update; whereas the cursor took ~5 seconds. I also saw a blip from 30ms for to 300ms while the regular update ran while there was no noticeable impact for the cursor.

    So here's what's the takeaway here? It's better for an update to run for several extra seconds to prevent a several hundred ms blip for a single user? Rhetorical question.

    Here's what I found to be interesting: the cursor update ran for ever (I had to cancel it) the first time I ran it because I did not set NOCOUNT ON. I ran the test a second time with 10K rows and the regular update runs in 50ms and returns the text (10000 row(s) updated) whereas the cursor update spits out the text, (1 row(s) updated 10,000 times). Just another reason that I prefer set-based over cursors and loops.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (6/1/2015)


    So here's what's the takeaway here? It's better for an update to run for several extra seconds to prevent a several hundred ms blip for a single user? Rhetorical question.

    Here's what I found to be interesting: the cursor update ran for ever (I had to cancel it) the first time I ran it because I did not set NOCOUNT ON.

    It was a quick illustrative example. If that was 10 million rows, and my sample user queries were several hundred per second instead of one every two seconds, then you might see numbers that made a better case to hurt the update in favor of the users.

    Since I ran both updates in the same window with SET NOCOUNT ON, I didn't run in to the problem you did. Sorry I neglected to include it in all my sample code.

  • Brian J. Parker (6/1/2015)


    Alan.B (6/1/2015)


    So here's what's the takeaway here? It's better for an update to run for several extra seconds to prevent a several hundred ms blip for a single user? Rhetorical question.

    Here's what I found to be interesting: the cursor update ran for ever (I had to cancel it) the first time I ran it because I did not set NOCOUNT ON.

    It was a quick illustrative example. If that was 10 million rows, and my sample user queries were several hundred per second instead of one every two seconds, then you might see numbers that made a better case to hurt the update in favor of the users.

    Since I ran both updates in the same window with SET NOCOUNT ON, I didn't run in to the problem you did. Sorry I neglected to include it in all my sample code.

    I hear you and you make an interesting point. Though the cursor dramatically slowed down the update I do see how your example would improve the user experience provided that there was a requirement to update say, 10 million rows. What you demonstrated was new to me. I never ran that kind of test and think it's something worth investigating further.

    Let me ask a follow-up question and please don't take this as argumentative (you make a good point regardless) you ever had to do a 10,000,000 row update on a transactional system? I do that kind of thing all the time in a data warehouse where i dont hand to worry about slowing down end users but can't think of where I would insert or update that many rows on an OLTP system. Not saying it does not happen I just can't think of a real-life example.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • There's a pretty good chance on most systems that 1) 100,000 rows would not cause a 1.3 second blip. It would normally take an order of magnitude higher number of rows and 2) on most systems I've seen, a 1.3 second blip would not be noticed because even a simple screen change takes more than 5 seconds. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Alan.B (6/1/2015)


    Let me ask a follow-up question and please don't take this as argumentative (you make a good point regardless) you ever had to do a 10,000,000 row update on a transactional system? I do that kind of thing all the time in a data warehouse where i dont hand to worry about slowing down end users but can't think of where I would insert or update that many rows on an OLTP system. Not saying it does not happen I just can't think of a real-life example.

    Well, in a way, I'm the one being argumentative, inasmuch as I read "never" and try to find counterexamples. Realistically, though?... in answer to your question, I can only think of once or twice in fifteen years, in one case cleaning denormalized data that was calculated wrong. But that's not "never"! 😉

    My "run a stored procedure for every row in a table" example is more realistic, to me, at least as far as things that actually happen in my career. I'm used to supporting complex products where a lot of magic I don't understand is happening inside a procedure (usually written by a third party) and somebody wants to automate things. Sometimes I might safely be able to rewrite, but given a stored procedure of sufficient complexity, it would be risky and/or prohibitively expensive. Especially if it doesn't need to perform that well.

    Both are rare examples, and I don't think people should make excuses to not learn to do things better. (For example, the set-based way of populating my sample table should have been intuitive to me.)

  • Brian J. Parker (6/1/2015)


    This also got me thinking of another real-world example I've faced. Let's say you have a third-party product that includes a really complicated stored procedure that does a lot of hard-to-understand wizardry. The product will call it from its GUI, passing in a couple of values. Imagine:

    EXEC spComplicatedMagic @Acct='3478623', @Amt='7.13', @Magic='FOO'

    Your employer can save a ton of work, it realizes, by importing a bunch of (Acct, Amt, Magic) rows in from a flat file (sent over by a customer) and calling the stored procedure for each set of values... otherwise some poor shmoe has to manually type them in to the GUI, one by one, probably making mistakes.

    It think that you would have a hard time doing a set-based operation that worked better than importing that flat file to a table and using a cursor to call the inscrutable stored procedure for each row. On each call of the procedure you catch errors and log to the table, maybe setting a flag as to its disposition.

    Again, perhaps a tortured example, but it's pretty close to what I've dealt with. It's not common but it does provide a rare example where a cursor makes sense.

    Totally agree! I hit this very situation (minus the typing) on a recent contract. I hated myself for it, but a cursor-based approach was the only real viable option.

    Well, actually, there was another option. I could have built a nvarchar from all the exec calls and run it as dynamic sql. It might have even run better but time pressures did not allow me to try it.

    Gerald Britton, Pluralsight courses

  • g.britton (6/2/2015)


    Brian J. Parker (6/1/2015)


    This also got me thinking of another real-world example I've faced. Let's say you have a third-party product that includes a really complicated stored procedure that does a lot of hard-to-understand wizardry. The product will call it from its GUI, passing in a couple of values. Imagine:

    EXEC spComplicatedMagic @Acct='3478623', @Amt='7.13', @Magic='FOO'

    Your employer can save a ton of work, it realizes, by importing a bunch of (Acct, Amt, Magic) rows in from a flat file (sent over by a customer) and calling the stored procedure for each set of values... otherwise some poor shmoe has to manually type them in to the GUI, one by one, probably making mistakes.

    It think that you would have a hard time doing a set-based operation that worked better than importing that flat file to a table and using a cursor to call the inscrutable stored procedure for each row. On each call of the procedure you catch errors and log to the table, maybe setting a flag as to its disposition.

    Again, perhaps a tortured example, but it's pretty close to what I've dealt with. It's not common but it does provide a rare example where a cursor makes sense.

    Totally agree! I hit this very situation (minus the typing) on a recent contract. I hated myself for it, but a cursor-based approach was the only real viable option.

    Well, actually, there was another option. I could have built a nvarchar from all the exec calls and run it as dynamic sql. It might have even run better but time pressures did not allow me to try it.

    I have a sneaking suspicion that repeatedly calling a stored procedure even from one statement with the stacked up calls to the procedure will incur a similar penalty to row by row, because you repeatedly load up the stored procedure invokations into the interpreter, and I'm wondering how fast T-SQL's call stack is. That would be an interesting one to test!

Viewing 15 posts - 61 through 75 (of 215 total)

You must be logged in to reply to this topic. Login to reply