Advice on BINARY_CHECKSUM or Alternatives

  • Hi:

    I inherited a process that compares about 90 fields from an import staging table to the corresponding fields in our production database. We determine which records have updates and plug them into a change audit table. The process currently has a very, very long WHERE statement that looks like this:

    WHERE

    Staging.Column1 <> Prod.Column1

    OR Staging.Column2 <> Prod.Column2

    OR Staging.Column3 <> Prod.Column3

    ...and so on for all 90 fields. I had what I thought was the brilliant idea of using a BINARY_CHECKSUM comparison to clean up the process a bit, but I didn't take into account the fact that all of the fields in the staging table are varchar, which is not the case in the production table. The differing data types end up resulting in differing BINARY_CHECKSUM values even if the actual values of the fields are the same.

    My question is this: Do you think I should use a CONVERT on every field in my BINARY_CHECKSUM to ensure that the data types are the same? Or would it be better to just keep the giant string of ORs and forget BINARY_CHECKSUM? Or is there another alternative I haven't tried that would work better?

    Thanks in advance!

  • can you create a view on top of the staging table tat would type the data into the same matching types as the production data?

    then you could binary checksum the view to compare agaisnt production, right?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Oh, good call! A view didn't even occur to me for some reason. Thanks! 🙂

  • Binary checksums are ok but they have thier limitations, asan Example if a Decimal number changes by a factor of 10m, eg 100.00 to 10.00 you will get the same checksum.

    Proof

    Declare @T1 as Decimal(18,6)

    Declare @T2 as Decimal(18,6)

    Set @T1=100.00

    Set @T2=@T1/10

    Print @T1

    Print Binary_Checksum(@T1)

    Print @T2

    Print Binary_Checksum(@T2)

    To show that it isnt the same if the Datatype is an Int

    Declare @T3 as int

    Declare @T4 as int

    Set @T3=100

    Set @T4=@T3/10

    Print @T3

    Print Binary_Checksum(@T3)

    Print @T4

    Print Binary_Checksum(@T4)

    You might also want to look at HASHBYTES but it has additional overheads.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Thanks for the info! I did do some research into the limitations of BINARY_CHECKSUM, and I ended up landing on it because the data I'm dealing with won't be anything that requires a great degree of mathematical precision or calculation. It's really just demographic data with a few money fields.

    I'm actually thinking that instead of building the view on the staging table, I might build it on the production table so I can make everything varchar and compare that way. I know comparing strings is less efficient than comparing integers, but this approach would have two advantages: One, I wouldn't have to go back and change the view if a production field's datatype happened to change (which is something over which I have no control); and two, it would eliminate any issues like the example you provided. What do you think? This is my first time doing something like this, so I want to make sure my reasoning is sound.

  • Have you thought about creating a calculated column using the BINARY_CHECKSUM on the table with to save on creating an additional view.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows. Some code to play with:

    DECLARE @Stag TABLE (Column1 VARCHAR(100), Column2 VARCHAR(100), Column3 VARCHAR(100))

    DECLARE @Prod TABLE (Column1 INT, Column2 DATETIME, Column3 VARCHAR(100))

    INSERT INTO @Stag

    VALUES

    ('1', '2012-12-01', 'Some Value'),

    ('2', '2011-11-06', 'Some Text'),

    ('3', '2012-01-01', 'More text too'),

    ('4', '2012-07-31', 'Any value'),

    ('5', '2012-08-30', 'Any Value')

    INSERT INTO @Prod

    VALUES

    (1, '2012-12-01', 'Some Value'),

    (2, '2011-11-05', 'Some Text'),

    (3, '2012-01-01', 'More text'),

    (4, '2012-07-31', 'Any value'),

    (6, '2012-07-31', 'Text')

    SELECT

    CASE

    WHEN [@Stag].Column1 IS NULL THEN 'Delete'

    WHEN [@Prod].Column1 IS NULL THEN 'Insert'

    WHEN NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) THEN 'Update'

    END AS Op,

    *

    FROM

    @Stag

    FULL OUTER JOIN

    @Prod ON [@Stag].Column1 = [@Prod].Column1

    Edit: also note that BINARY_CHECKSUM is not always realible. From BOL:

    BINARY_CHECKSUM(*) will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.

  • Peter Brinkhaus (2/7/2012)


    As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.

    Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (2/8/2012)


    Peter Brinkhaus (2/7/2012)


    As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.

    Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.

    Different semantic and results.

    The NOT EXISTS ( ... INTERSECT ...) pattern is designed to entirely replace the long lists of expressions like (column1 <> column2) OR (column1 IS NULL and column2 IS NOT NULL) OR (column1 IS NOT NULL AND column2 IS NULL). See http://sqlblog.com/blogs/paul_white/archive/2011/06/22/undocumented-query-plans-equality-comparisons.aspx

  • SQL Kiwi (2/8/2012)


    drew.allen (2/8/2012)


    Peter Brinkhaus (2/7/2012)


    As an alternative, you could use NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) to compare rows.

    Using EXCEPT should give you the same results, is probably more efficient, and is also a lot clearer.

    Different semantic and results.

    I don't see it. As far as I can tell these two queries produce exactly the same results and the execution plans are almost identical.

    SELECT

    CASE

    WHEN [@Stag].Column1 IS NULL THEN 'Delete'

    WHEN [@Prod].Column1 IS NULL THEN 'Insert'

    WHEN NOT EXISTS (SELECT [@Stag].* INTERSECT SELECT [@Prod].*) THEN 'Update'

    END AS Op,

    *

    FROM

    @Stag

    FULL OUTER JOIN

    @Prod ON [@Stag].Column1 = [@Prod].Column1

    SELECT

    CASE

    WHEN [@Stag].Column1 IS NULL THEN 'Delete'

    WHEN [@Prod].Column1 IS NULL THEN 'Insert'

    WHEN EXISTS (SELECT [@Stag].* EXCEPT SELECT [@Prod].*) THEN 'Update'

    END AS Op,

    *

    FROM

    @Stag

    FULL OUTER JOIN

    @Prod ON [@Stag].Column1 = [@Prod].Column1

    Admittedly, I haven't studied the execution plan in detail to see exactly how different they are, but a cursory overview indicates that EXCEPT has one fewer filters on it.

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • drew.allen (2/8/2012)


    I don't see it. As far as I can tell these two queries produce exactly the same results and the execution plans are almost identical.

    I read your original brief statement as suggesting the whole thing could be replaced by EXCEPT. Now you have added the detail, your meaning is clear.

    Admittedly, I haven't studied the execution plan in detail to see exactly how different they are, but a cursory overview indicates that EXCEPT has one fewer filters on it.

    GBN from dba.se asked the same question recently on my blog. The answer I gave was:

    In my experience the INTERSECT form tends to give produce 'better' plans, because the optimizer more readily simplifies the logic into a simple anti-semi-join.

    Using the test script, I do get a slightly different plan for the EXCEPT form (an extra anti-semi-join, constant scan, and a start-up filter). This may be version-dependent (I used R2 build 2789).

    Finally, I personally find the INTERSECT form to be more intuitive, but I can quite see that others might prefer to think of the problem the other way around.

    Those comments refer to the example on my blog, not here, of course. The apparently more complex plan for INTERSECT in Peter's example is a consequence of the mismatched types. If the data types for @Stag and @Prod match, the usual INTERSECT plan appears.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply