Performance Tuning Large update statement--HELP

  • I have a 1.2 TB database in which I am updating a couple columns across 11 tables.

    T1 (F_name, L_Name, Address)

    t2 (User_Id, Email_Address)

    ..etc

    All tables have different columns etc. I have written a while loop that produces the UPDATE statement for me per each table, and then executes the UPDATE statement on that table.

    The Update statement appears below:

    USE DBNAME;

    UPDATE T1

    SET T1.F_Name = fData(F_Name),

    T1.L_Name = fData(L_Name),

    T1.Address = fData(Address)

    fData is a function that passes the field in and obfuscates the data so that it can be used in our DEV environment without allowing sensitive data into developers hands from production. I am not concerned about table locks in this instance as this process occurs on my personal testbox where the database sits.

    The issue is I have ran this statement for say T1 (~1.2 million rows) and takes about ~2-3 minutes. Then when running on say table T6 (~250 million rows) and similar column types it is taking over 12 hours. I have tried extrapolating times and this should take closer to 2.5 hours and not 8. I'm wondering what kind of performance enhancements can I make to this process.

    Things considering: Setting database into simple mode for logging purposes, attempting to complete updates in batches (unneeded for locking purposes, but wondering will this speed up performance as far as SQL retrieving smaller set and updating smaller sets rather than the whole table at once), updating based on table indices, disabling tables indices, setting transaction level isolation...any other thoughts. Or can you expound on how each of these would help a LARGE table update process disregarding any blocking needs.

    Testbox Specs:Dual quad core processors, 96GB RAM

    SQL Specs:SQL 2008R2

  • Many of the things you listed as possibilities are likely to help but probably the biggest performance woe is calling the same scalar function 11 times for 250 million rows. That is 2.75 billion calls to that same scalar function. :w00t: Depending on the complexity of the function turning it into an iTVF is likely to be the biggest help. I am afraid though that being a data obfuscater it is not likely to be able to work as single statement.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (7/15/2013)


    Many of the things you listed as possibilities are likely to help but probably the biggest performance woe is calling the same scalar function 11 times for 250 million rows. That is 2.75 billion calls to that same scalar function. :w00t: Depending on the complexity of the function turning it into an iTVF is likely to be the biggest help. I am afraid though that being a data obfuscater it is not likely to be able to work as single statement.

    Agree. You might also be apply to in-line the logic into a CROSS APPLY, which could also drastically improve performance vs a called scalar function.

    Finally, explicitly specify an exclusive table lock to prevent from SQL from possibly having to escalate locks.

    ...

    UPDATE dbo.tablename WITH ( TABLOCKX )

    ...

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Also any thoughts on this approach?

    Select * INTO T1_Obfuscated FROM T1. (obfuscating only columns needed.) then dropping T1 and renaming T1_Obfuscated to T1. Not really too worried about space concerns. More about time to complete and performance. As this process pretty much ran all weekend and did not finish. I have a requirement for it to be complete within 24 hours.

  • dspink (7/15/2013)


    Also any thoughts on this approach?

    Select * INTO T1_Obfuscated FROM T1. (obfuscating only columns needed.) then dropping T1 and renaming T1_Obfuscated to T1. Not really too worried about space concerns. More about time to complete and performance. As this process pretty much ran all weekend and did not finish. I have a requirement for it to be complete within 24 hours.

    Can you post your function and maybe at least a sample table that would be similar to the real one?

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • dspink (7/15/2013)


    Also any thoughts on this approach?

    Select * INTO T1_Obfuscated FROM T1. (obfuscating only columns needed.) then dropping T1 and renaming T1_Obfuscated to T1. Not really too worried about space concerns. More about time to complete and performance. As this process pretty much ran all weekend and did not finish. I have a requirement for it to be complete within 24 hours.

    I would think that would be far more overhead than the UPDATE.

    Some other important things I left out:

    You should pre-allocate enough log space to handle the UPDATE.

    And, since this could be a huge amount of log data, you might want to break the UPDATE into batches, based on the cluster key.

    SQL DBA,SQL Server MVP(07, 08, 09) "It's a dog-eat-dog world, and I'm wearing Milk-Bone underwear." "Norm", on "Cheers". Also from "Cheers", from "Carla": "You need to know 3 things about Tortelli men: Tortelli men draw women like flies; Tortelli men treat women like flies; Tortelli men's brains are in their flies".

  • Smaller table after obfuscation: (see attachment) Before obfuscation may have looked like Firstname JOHN, MiddleName NULL, LastName Williams. The function calls an .dll assembly which inputs a string (FirstName = 'JOHN') for each character uses pre-determined algorithm and assigns it new value. Very similar to ABCD will always equal GHMD this is done for all uppercase, lowercase, and # 0-9. This string is then returned as the value shown in the table. Seems very RBAR like, I know but it is what the developer asked me to implement.

  • One of the things that most people don't ever run into is the "Tipping Point" that each update has and when they do run into it, they don't know what's happening. To make a much longer story shorter, you've simply given the server too much to do for the memory and cache resources it has.

    For example, on a test table on my 11 year old desktop box, it takes just 3 seconds to update a million rows. I know... it's slow but it's an old box. As expected, it only takes twice as long (6 seconds) to update twice as many rows (2 million rows). When I try to update 3 million rows, it takes over 2 hours.

    The fix I used for that was to do my updates in a loop where each iteration updated just one million rows at a time (a batch of 1 million rows per iteration). This isn't a RBAR loop... it's a process control loop and it adds virtually no extra time to the process. When I applied it to the 3 million rows, it took only 3 times as long (9 seconds) as it did for a million rows instead of the previous 2 hours.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for that as I had written off batching due to not needing for locking purposes and wondered if smaller retrieval sizes would speed things up at all.

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply