ADO recordsets and scaling

  • I am trying to improve some legacy code that is used on tables of up to around 20 million rows.  It uses VB6 and ADO to update rows roughly like this:

        Open editable ADO server-side recordset with next batch of records

            Process each record and update the row (using rs.Update, not rs.UpdateBatch)

            Repeat for each row

        Repeat for each batch

    (Don't suggest moving the process to the server...the client processing involves some buggy third-party tools that aren't getting anywhere near my SQL servers.)

    This program has worked tolerably well for years, but I'm trying to improve it and re-architect it for conversion to .NET.  I have changed it to:

        Open detached client-side ADO recordset with next batch of records (input fields and key only)

            Process each record and create UPDATE T-SQL command for output fields

            Use asynchronous Connection.Execute to update the row

            Repeat for each row

        Repeat for each batch

    The new version is much faster than the old one (350,000 rows per hour vs 200,000).  But the old one did not overload the CPU, and we could run up to 4 copies on a single CPU client (more on dual CPU boxes) and still get 200,000 RPH out of each (if the server wasn't busy), or run copies on multiple clients and get over 1,000,000 RPH.

    The new version actually slows down when a second copy runs on the same box (even with dual CPUs), and only has a marginal improvement (slightly over 400,000RPH) when run from two clients.  More than one instance running brings the client to its knees.

    One obvious improvement is to spool the UPDATE commands and send them in batches (around 4000-10000 bytes apiece?).  But the recordset version isn't batching updates, so that would be an unfair comparison.  What is ADO doing that makes it so much more scalable?  Is it creating an updatable cursor on the server?  The recordset is opened with optimistic locking, but is there a difference in locking behavior?

  • Tough question to answer without the necessary nitty gritty details but here is what I think.

    Because Batch update is not being used I would assume that some parameterizing is happening under the ADO curtains therefore improving the performance and maximizing reusage. To check that you have to take a Profiler trace and deal with the prepare commands to see if that assumption is correct.

    Don't know why the .NET version is not doing it though.

    2. If you insist on doing it one record at a time create an stored  procedure that performs the update (USE PARAMETERS ON THE CALL) and that will workaround the reuse problem

    3. Why not "spool" ? as you said on the server. I know that I favor server side processing ( Can't blame a DBA for that ) but in those cases I have experience massive gains when dealing with a non-trivial amount of records 

    HTH

     


    * Noel

  • Just to clarify, I haven't migrated the code to .NET yet but I've rewritten it to work like a DataReader and SQLCommand so migration will be easier when the time comes.  (And also because I expected better performance.)

    The next step I had in mind was to concatenate the UPDATE commands in batches of 4k or 5k and send them with an asynchronous Connection.Execute.  I just wanted to make sure I had the best approach before I went any further.  I can't use a predefined stored proc because the program might be used on any table in any db on any server, with a variable set of columns.  I could create a temporary custom stored proc, but again I want to make sure I'm doing the right thing before I do more tweaking.

    I have been indoctrinated that CURSORS ARE EVIL!, but I'm wondering if it would be better to open an updatable cursor and use "UPDATE ... WHERE CURRENT OF" rather than "UPDATE ... WHERE <pk> = x".  I would still have the option of spooling commands and/or using a temporary stored proc.  I have no experience with updatable cursors, but is that the secret of ADO's better scalability?

  • Is the logic of the update very complicated?

    Maybe you could do something like

    update table set.... where pk between X and Y.

    In the client loop, then X becomes y + 1 (next row anyways), then jump 5k or 10k ahead in the record set (using the bookmark property), reset Y and resend the query. That would avoid transfering a lot of data on the network and sending huge queries that the server would have to parse. And since there would be a lot of rows updated at the same time, you might save some time there too.

  • There is a lot of VB code used to determine the update values, and they're unique for each row.  Some of the processing uses third-party dlls, some COM and some plain old API calls.  These dlls are not going on the servers, both for licensing and stability reasons.

  • "both for licensing and stability reasons."

    And sanity . This seems to be very complexe. Are you sure it's not possible to rewrite that to have case expressions to set the values? I think that's really the next boost you can have (altough you would move some of the work on the server... which could make the whole thing slow down in the end).

    Have you considered making a big text file and using bcp to pump the data back in?

  • Ok back from lunch.

    1. >>I can't use a predefined stored proc because the program might be used on any table in any db on any server, with a variable set of columns.<<

    BAD, BAD, BAD... No application should be designed that way It will be felt really hard at maintenence schedules!

    2. >>I have been indoctrinated that CURSORS ARE EVIL << ABSOLUTLY CORRECT (Am I a DBA or what!   )

    3. You should follow the sp Idea but if you still insist that sp is not for you then listen to this:

    Because you are directly updating tables on the server you may as well move ALL your set into a Temp table in the server and then perform chunck by chuck something like:

    Update d

    From

          desination d

          join

          #YourTempTable t on t.pkey = d.pkey

    where

          t.pkey between min_chunck_id and max_chunck_id

    Hth

     

     

     


    * Noel

  • BAD, BAD, BAD... No application should be designed that way It will be felt really hard at maintenence schedules!

    He's already feeling that one .

    CURSORS ARE EVIL. Ok I'm thinking like a DBA.

    Update by chunck... almost same thing I proposed, but you'd have to drop the actual table once the update is over... not a big problem.

  • >> Update by chunck... almost same thing I proposed, but you'd have to drop the actual table once the update is over... not a big problem. <<

    Really the issue is that the Insert Happens on a TEMP table so the bandwith is consumed once and on something that does not block anyone. I mentioned the chuncks because depending on the architecture you may need them but if this is a one time thing (at off pick) you don't need it and can be done in ONE shoot

    Droping a Temporary table is extremely fast I didn't even though about that of being a problem whatsoever

     

     


    * Noel

  • I never said it was a problem... Just an extra step to know about.

  • So, Is it OK if I assume you are Agreeing with me?

     


    * Noel

  • That word is banned from my dictionnary untill next week .

  • Remi - forget it, this will never be rewritten in SQL.  There is a lot of processing on the client for each row.  I would rather tie up a couple of free desktop boxes with that processing than bog the server down for days.  And with the instability of one of the (unfortunately necessary) dlls, I won't even consider writing it as an external stored procedure.  It's not going to run in SQL Server's address space, and it's not even going to be installed on the server.

    Noel - We don't have a single defined process, we do a lot of custom work in batches on different servers.  The program is a standard application for us but it is used for various projects that all have their own databases.  It is better to point the application to different tables than to force everyone to move their data in and out of a standard table.  And I'm not insisting that sp's are not the way to go, I'm trying to make sure I use the best application design before I write the sp.

    My fundamental question is, what does ADO do differently that scales almost linearly (until you swamp the server or network)?  Has anyone else had this problem with ADO recordset-based batch process apps migrated to .NET using DataReader & SQLCommand?

    ADO is not moving client code to the server, or playing games with temp tables.  When it does a Recordset.Update, the program waits until the update has been completed on the server.  But somehow it manages to do the updates and keep the server-side cursor synced with the client recordset with low enough overhead that four or more processes can be doing the same thing at the same time.  I think we've had as many as 16 copies running at once (on multiple clients) updating the same table (separate row ranges).  It doesn't keep scaling linearly, but the extra clients add to the aggregate rows per minute.  The new program doesn't even scale well to 2 clients.

  • Have you ran a trace to see what calls are made on the server with each version of the code. Maybe there are also settings that you misseds.

  • I said before that you need a PROFILER TRACE to know the difference and that my guess is that it is PARAMETERIZING the UPDATES. It is difficult without the code or the Trace to know what is really going on.

    BTW: That temp table "play" is the way to go if you want to scale!!

     


    * Noel

Viewing 15 posts - 1 through 15 (of 45 total)

You must be logged in to reply to this topic. Login to reply