UPDATE when the values are the same

  • L' Eomot Inversé (9/16/2012)


    Do you really imagine that the query processor's interface to the storage engine operates on pages as opposed to on rows or attributes?

    How does it matter what do I imagine?

    Anyway, you've already made in clear when responding to Gail that you aren't interested in looking into any references that are provided for you, so why do you expect I should waste my time providing them for you?

    Did you open that link yourself?

    I read the book Gail referred to from start to finish.

    And did not find a single reference in there relevant to this topic.

    Waste of time.

    You clearly don't consider posting personal attacks as wasting time, unlike posting any professional comments.

    You made it clear that you lack professionalism and don't have anything to say on the matter.

    Would you mind to spare us from your rubbish input?

    _____________
    Code for TallyGenerator

  • SQL Kiwi (9/16/2012)


    Sergiy, I should add that the fun in the example was in working out *why* that particular case gave the performance results it did. With a long string of a suitable collation, it is more expensive to do the collation-aware comparison than it is to decide not to perform the update based on a binary comparison in the storage engine. This was a point you refused to accept earlier, so I thought you might find it educational.

    If you'd think about it you'd most probably find it aducational for yourself.

    🙂

    I figured out why you built the test this way and I have no argument about the point:

    comparing binary content of pages is and will always be more effective than comparing strings row by row (usually, several different comparisons over the data in the same page).

    I quickly added another sample test to prove the point:

    UPDATE dbo.Test WITH (TABLOCKX)

    SET Name = CASE WHEN Name = @Value THEN @Value ELSE Name end;

    @Value is the string from you example.

    On my machine this changed version of "Non-updating" UPDATE usually takes about twice as long as "Filtered" one.

    But I must admit that results vary from run to run quite significantly. Once I've got "CASE Non-updating" to finish faster than simple "Non-updating".

    But at the end of the day, as it follows from your example:

    - the query submits affected rows and creates updated versions of the pages;

    - storage engine compares old and new versions and desides if further action is required.

    Some still may argue that it's rows, not newer versions of pages which are compared by the storage engine.

    But so far I'm aware of only page-level activity performed by the storage engine.

    _____________
    Code for TallyGenerator

  • Sergiy (9/16/2012)


    Actually, you did. Trigger is executed after UPDATE. If SQL Server figures out that the values are the same and does not proceed with writing to the pages then trigger has no source for "inserted" table.

    You misunderstood. If the reading side of an update plans produces no rows (i.e. they are filtered out by e.g. a WHERE clause predicate) then no rows arrive at the update operator in the plan, so there is nothing in the inserted or deleted tables.

    I was talking not about overheads caused by a trigger. I was asking if SQL Server will do something different for a table with trigger in order to have "inserted" and "deleted" tables populated.

    That is asking about the overheads caused by a trigger. Yes, SQL Server does something different for a table with a trigger: it versions the changed rows (non-updating or otherwise). I demonstrated this in my recent blog post http://bit.ly/ForwardGhost

    Still not clear why do you think so. If SQL Server performs the same actions as in case with WHERE clause, and binary comparison of pages is faster than collation based comparison of strings - why do you advocate for the worst of these 2 options?

    This is just the same point you made in one of your first replies to this thread (the Event table update). In general, avoiding a non-updating update is the better option. I provided my counter-example to show an interesting (and fun) exception.

  • SQL Kiwi (9/16/2012)


    Sergiy (9/16/2012)


    Actually, you did. Trigger is executed after UPDATE. If SQL Server figures out that the values are the same and does not proceed with writing to the pages then trigger has no source for "inserted" table.

    You misunderstood. If the reading side of an update plans produces no rows (i.e. they are filtered out by e.g. a WHERE clause predicate) then no rows arrive at the update operator in the plan, so there is nothing in the inserted or deleted tables.

    I don't think so.

    Even if no rows are updated the tables are still created.

    SQL Server still allocates pages for the tables, names all the columns, etc.

    They have nothing in them - well, that means the source of the records for the tables contained no records.

    But it does not mean the source of the "nothing" was not there.

    _____________
    Code for TallyGenerator

  • I did a little experiment using string values of different length.

    Here is the base script:

    USE tempdb;

    SET NOCOUNT ON;

    GO

    CREATE TABLE dbo.Test

    (

    Name char(200)

    COLLATE Latin1_General_CI_AI NULL

    --COLLATE Latin1_General_BIN NULL

    );

    GO

    INSERT dbo.Test

    SELECT TOP (123456) REPLICATE('X', 200)

    FROM sys.columns AS c

    CROSS JOIN sys.columns AS c2;

    GO

    DBCC SHOWCONTIG ('dbo.Test') WITH all_indexes, TABLERESULTS

    DBCC FREEPROCCACHE;

    CHECKPOINT;

    GO

    DECLARE @start datetime2 = SYSDATETIME(), @Value char(200);

    SET @Value = REPLICATE('X', 200);

    SET STATISTICS IO ON;

    UPDATE dbo.Test WITH (TABLOCKX)

    SET Name = @Value

    WHERE NOT Name = @Value;

    SET STATISTICS IO OFF;

    SELECT test = 'Filtered', time_ms = DATEDIFF(MILLISECOND, @start, SYSDATETIME())

    GO

    DBCC FREEPROCCACHE;

    CHECKPOINT;

    GO

    DECLARE @start datetime2 = SYSDATETIME(), @Value char(200);

    SET @Value = REPLICATE('X', 200);

    SET STATISTICS IO ON;

    UPDATE dbo.Test WITH (TABLOCKX)

    SET Name = Name;

    SET STATISTICS IO OFF;

    SELECT test = 'Non-updating', time_ms = DATEDIFF(MILLISECOND, @start, SYSDATETIME());

    GO

    DBCC FREEPROCCACHE;

    CHECKPOINT;

    GO

    DECLARE @start datetime2 = SYSDATETIME(), @Value char(200);

    SET @Value = REPLICATE('X', 200);

    SET STATISTICS IO ON;

    UPDATE dbo.Test WITH (TABLOCKX)

    SET Name = CASE WHEN Name = @Value THEN @Value ELSE Name end;

    SET STATISTICS IO OFF;

    SELECT test = 'Non-updating CASE', time_ms = DATEDIFF(MILLISECOND, @start, SYSDATETIME());

    GO

    DBCC FREEPROCCACHE;

    CHECKPOINT;

    GO

    DECLARE @start datetime2 = SYSDATETIME(), @Value char(200);

    SET @Value = REPLICATE('Y', 200);

    SET STATISTICS IO ON

    UPDATE dbo.Test WITH (TABLOCKX)

    SET Name = @Value;

    SET STATISTICS IO OFF

    SELECT test = 'Updating', time_ms = DATEDIFF(MILLISECOND, @start, SYSDATETIME());

    GO

    DROP TABLE dbo.Test;

    Basically no difference comparing to yours one, Paul, except I used REPLICATE to easily and synchronously change from char(200) to char(50) or char(8000).

    I recorded the times and calculated the rate of effectiveness of other methods comparing to "Filtered"

    Here is the outcome

    char(50) COLLATE Latin1_General_CI_AI

    Filtered87

    Non-updating CASE248285%

    Non-updating114131%

    Updating206237%

    char(200) COLLATE Latin1_General_CI_AI

    Filtered295

    Non-updating CASE472160%

    Non-updating13746%

    Updating27593%

    char(500) COLLATE Latin1_General_CI_AI

    Filtered716

    Non-updating CASE945132%

    Non-updating19427%

    Updating52774%

    char(1000) COLLATE Latin1_General_CI_AI

    Filtered1421

    Non-updating CASE1702120%

    Non-updating26719%

    Updating101071%

    char(8000) COLLATE Latin1_General_CI_AI

    Filtered10997

    Non-updating CASE12796116%

    Non-updating132112%

    Updating909183%

    No surprise so far.

    Longer strings cause longer comparison operations, so the winning rate for "Non-updating" grows.

    Then I decided to minimuse the impact of collation rules by chosing the least expensive collation.

    Here is the outcome:

    char(50) COLLATE Latin1_General_BIN

    Filtered13

    Non-updating CASE1511162%

    Non-updating113869%

    Updating2031562%

    char(200) COLLATE Latin1_General_BIN

    Filtered17

    Non-updating CASE1781047%

    Non-updating138812%

    Updating2521482%

    char(500) COLLATE Latin1_General_BIN

    Filtered31

    Non-updating CASE240774%

    Non-updating191616%

    Updating5481768%

    char(1000) COLLATE Latin1_General_BIN

    Filtered44

    Non-updating CASE322732%

    Non-updating269611%

    Updating12172766%

    char(8000) COLLATE Latin1_General_BIN

    Filtered264

    Non-updating CASE1708647%

    Non-updating1328503%

    Updating107854085%

    "Filtered" solution wins clearly.

    With the winning rate which does not leave any doubt: overhead of opening/closing transaction, applying different kind of locks (actually we have the same table lock forced in all cases), etc. cannot bring such a difference.

    So, what do you think SQL Server is doing during those 1.1 seconds which is the difference between "filtered" and "Non-updating"?

    Interpretations are welcome.

    _____________
    Code for TallyGenerator

  • Sergiy (9/16/2012)


    I don't think so.

    Even if no rows are updated the tables are still created.

    SQL Server still allocates pages for the tables, names all the columns, etc.

    They have nothing in them - well, that means the source of the records for the tables contained no records.

    But it does not mean the source of the "nothing" was not there.

    The inserted and deleted tables are not separate real tables, SQL Server does not "allocate pages, name all the columns, etc". The inserted and deleted virtual tables are a 'view' of version store records in SQL Server 2005 and later. Before SQL Server 2005, a backward scan of the transaction log records was used.

    Without a trigger, no versions are generated. With a trigger, versions may be generated. The script below demonstrates this:

    CREATE TABLE dbo.Test (col1 varchar(max) NULL);

    GO

    INSERT dbo.Test (col1) VALUES ('Hello');

    GO

    -- No trigger, nothing in the version store

    BEGIN TRANSACTION;

    UPDATE dbo.Test SET col1 = 'Hello';

    SELECT * FROM sys.dm_tran_version_store AS dtvs;

    COMMIT TRANSACTION;

    GO

    -- A trigger that does nothing

    CREATE TRIGGER trg

    ON dbo.Test AFTER UPDATE

    AS RETURN;

    GO

    CHECKPOINT;

    GO

    -- Same non-updating update

    BEGIN TRANSACTION;

    UPDATE dbo.Test SET col1 = 'Hello';

    -- Now we have two version store records

    SELECT

    decoded = CONVERT(char(5), SUBSTRING(dtvs.record_image_first_part, 12, 5)),

    *

    FROM sys.dm_tran_version_store AS dtvs;

    COMMIT TRANSACTION;

  • Sergiy (9/16/2012)


    I figured out why you built the test this way and I have no argument about the point:

    comparing binary content of pages is and will always be more effective than comparing strings row by row (usually, several different comparisons over the data in the same page).

    You're almost there, but the point is the binary comparisons in the storage engine are per row, not per page. It is a shame you are not in a position to run the debugger session yourself, because it shows the row-by-row processing very clearly:

    - The update operator asks for a row of update data from the table scan operator.

    - The storage engine performs the 'update' for this row (quotes because a write might not actually occur).

    - The update operator asks for the next row of information from the table scan operator.

    - The process repeats until no more rows are available from the table scan.

    A 'binary comparison of pages' simply does not occur.

    But at the end of the day, as it follows from your example:

    - the query submits affected rows and creates updated versions of the pages;

    - storage engine compares old and new versions and desides if further action is required.

    Some still may argue that it's rows, not newer versions of pages which are compared by the storage engine.

    But so far I'm aware of only page-level activity performed by the storage engine.

    This is the basic misunderstanding - processing occurs row by row. This 'iterator model' is common knowledge and documented e.g. in Microsoft Press internals books, by people who wrote the SQL Server code.

    It would be horribly inefficient for the storage engine to compare pages (not rows) in the general case - we would need to copy the whole page before every row update, perform the update, and then compare the whole new page with the whole old page. Update information is not guaranteed to arrive at the update operator in page order (it depends e.g. on query plan shape) so it could mean copying and comparing a page hundreds of times over, once for each updated row on the page. That would be madness. It is much more efficient not to copy the page at all, and simply compare the binary value passed in for update with the binary value already in the row - no expensive page copy & compare needed.

  • Sergiy (9/16/2012)


    Basically no difference comparing to yours one, Paul, except I used REPLICATE to easily and synchronously change from char(200) to char(50) or char(8000).

    I chose a literal string rather than REPLICATE for complicated reasons, and to be fair to all the methods (see my post http://bit.ly/ComputeScalar if you are interested). To avoid debates about which method might evaluate REPLICATE more times than another, I used a literal string.

    "Filtered" solution wins clearly.

    Which is why I say filtering is generally the best option (except in cases like my carefully-constructed example).

    With the winning rate which does not leave any doubt: overhead of opening/closing transaction, applying different kind of locks (actually we have the same table lock forced in all cases), etc. cannot bring such a difference. So, what do you think SQL Server is doing during those 1.1 seconds which is the difference between "filtered" and "Non-updating"? Interpretations are welcome.

    It's all about how many rows are produced by the Table Scan. When filtered, no rows are produced, so no update processing happens. The main cost is evaluating 'WHERE NOT Name = @Value' 123,456 times. With a binary collation, this comparison is fast; with a more complex collation it can be quite slow (slower than not filtering at all, and just passing all the rows to the update operator for binary comparison, row by row, in the storage engine).

    In your CASE test, all 123,456 rows are produced one at a time from the table scan and processed one at a time by the update operator. We now have the overhead of passing rows one at a time between the scan and update operators, and we have to compare the result of the CASE expression with the value already present in the row, for each row. This explains the difference in execution time.

  • L' Eomot Inversé (9/16/2012)


    If the query processor figures out that it is not going to update the row (this is decided on the basis of where clause and on clause) that row will not be in the inserted or deleted table. If the query passes the row update to the storage engine and the storage engine decides not to update the row in its buffered page (determined by the update describing the same binary value of each field as is already in the buffer) the row will be in the inserted and deleted tables. These are two completely different occurrences, involving different queries.

    It doesn't perform the same actions in the two cases; in one case the storage engine is invoked (but doesn't modify the page) and the non-updating update can cause a trigger to be fired; in the other case the query processor doesn't do update because the where clause excludes it, the storage engine is not invoked, and there is no non-updating update to cause a trigger to be invoked. This seems to me to be a big and clear difference.

    That matches my understanding, Tom.

  • For those that don't have debugger skills and want to look at the internals of this, I wrote a blog post that shows how to dig into the internals using Extended Events and materialized callstacks for each of the events being fired when the UPDATE tests are run.

    http://sqlskills.com/blogs/jonathan/post/Looking-at-Non-Updating-Updates-with-Extended-Events.aspx

    The materialized callstacks validate everything that Paul has been saying in this thread and refutes definitively the argument that SQL Server operates on pages not rows. It also validates that row versions are generated even when nothing changes if a trigger exists.

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

Viewing 10 posts - 151 through 159 (of 159 total)

You must be logged in to reply to this topic. Login to reply