Using EXCEPT to Determine Row Differences

  • I had to do a similar job.

    I used triggers to keep the primary key of rows that changed into a log table.

    Then I had a process that processed those rows into the target and emptied the log table on a regular basis.

    Of course, it’s a bit less simple then what I described, but not that much.

    With this approach, I didn't have to write thousands of columns manually.

    Also, this approach avoids comparing each row over and over again between the two databases which can be very expensive when table have millions of rows.

    In my case, the two databases weren't close to each other and the link between them was made through internet.

    So, my solution couldn't afford to use too much bandwidth.

    That is it,

    Please note that my solution doesn't give you row difference. So it's irrelevant to the subject topic.

    I posted this comment to let people know another possibility to achieve synchronisation.

  • brymen (3/25/2014)


    In other words, the article shows how to use [font="Courier New"]EXCEPT[/font] to do something really cool. Compare records with matching primary keys to see if they are identical. But the example doesn't demonstrate the cool part. OK, I get it.

    Not exactly. The article shows how to use EXCEPT to compare records looking for changes. And it includes a technique useful in order to distinguish "Insertable" rows from "Updatable" rows, which is described but not exploited by the example.

    The cool part is the lack of something like:

    ...

    WHERE ISNULL(T1.C1, 'NULLVALUE') <> ISNULL(T2.C1, 'NULLVALUE')

    OR ISNULL(T1.C2, 'NULLVALUE') <> ISNULL(T2.C2, 'NULLVALUE')

    ...

    OR ISNULL(T1.C99, 'NULLVALUE') <> ISNULL(T2.C99, 'NULLVALUE')

    On a side note, regarding other comments: if you're synching data which has no primary key, do yourself a favor and create a synthetic one or regret later about not having one.

  • a.guillaume (3/25/2014)


    hello,

    The join between #source and #Target in the SELECT .... EXCEPT is not essential

    You can obtain the same result with a best cost

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [Source]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [Target] [T]

    ;

    Am I wrong?

    Arnaud

    By looking at the actual execution plans (on my machine, your results may vary slightly) with the inner joins and without, the cost of the code WITH the joins has a cost (.0072033) of less than half the cost of your example without joins (.0155953). In your example SQL Server must compare every column of each row to every column of every other row. In a real world case, there would probably be some indexed columns that it would use, but the process would probably still be pretty costly. But also in a real world case, you could be comparing millions of rows and many more columns. Furthermore the joined columns would most likely be indexed. In a real world case, the cost difference between the two queries would most likely be much more disparate than in this small example. The example with the joins would most likely far outperform the example without the joins.

    Darren

  • Thank you for demonstrating a simple solution to a common problem.

  • Thanks Douglas, simply brilliant. Now, this is what set based logic is all about!

    The more you are prepared, the less you need it.

  • I've found the EXCEPT command to be also helpful inserting rows. I compare thousands of incoming phone numbers against millions that have already been entered. All I want are the new ones. This works very fast to add only the new ones.

    I've also found out the hard way that for tables with no primary keys, even if completely identical, the EXCEPT command produces incorrect results. Let it be a reminder to key all tables, or else lose a valuable tool. Not that there aren't lots of other reasons to use KEYS.

  • Is there any performance gain for large tables in using except outside of a merge statement?

  • Sorry, see post later

  • I'm ok with Darren

    To understand, I tried this 3 SQL in a lot and obtain this results

    set statistics io on

    ----- SQL1 ------

    SELECT .[ID],.[Item], .[Price], .[OrderDate], .[Units], .[ShipmentDate]

    FROM [Source] INNER JOIN [Target] [T] ON .[ID] = [T].[ID]

    EXCEPT SELECT [T].[ID] , [T].[Item], [T].[Price], [T].[OrderDate] , [T].[Units], [T].[ShipmentDate]

    FROM [Target] [T] INNER JOIN [Source] ON [T].[ID] = .[ID];

    ----- SQL2 ------

    SELECT .[ID],.[Item], .[Price], .[OrderDate], .[Units], .[ShipmentDate] FROM [Source]

    EXCEPT SELECT [T].[ID] , [T].[Item], [T].[Price], [T].[OrderDate] , [T].[Units], [T].[ShipmentDate]

    FROM [Target] [T]

    ----- SQL3 ------

    SELECT .[ID],.[Item], .[Price], .[OrderDate], .[Units], .[ShipmentDate] FROM [Source]

    EXCEPT SELECT [T].[ID] , [T].[Item], [T].[Price], [T].[OrderDate] , [T].[Units], [T].[ShipmentDate]

    FROM [Target] [T] INNER JOIN [Source] ON [T].[ID] = .[ID];

    --The RESULTS

    -- Messages

    ----- SQL1 ------

    (3 row(s) affected)

    Table 'Target'. Nombre d'analyses 0, lectures logiques 24...

    Table 'Source'. Nombre d'analyses 1, lectures logiques 14...

    ----- SQL2 ------

    (3 row(s) affected)

    Table 'Target'. Nombre d'analyses 1, lectures logiques 13...

    Table 'Source'. Nombre d'analyses 1, lectures logiques 2...

    ----- SQL3 ------

    (3 row(s) affected)

    Table 'Target'. Nombre d'analyses 0, lectures logiques 12...

    Table 'Source'. Nombre d'analyses 1, lectures logiques 14...

    --The execution plan

    Query 1 : Query cost (relative to the batch) : 45%

    Query 2 : Query cost (relative to the batch) : 21%

    Query 3: Query cost (relative to the batch) : 34%

    So what about this results ?

  • a.guillaume (3/25/2014)


    I'm ok with Darren

    To understand, I tried this 3 SQL in a lot and obtain this results

    set statistics io on

    ----- SQL1 ------

    SELECT .[ID],.[Item], .[Price], .[OrderDate], .[Units], .[ShipmentDate]

    FROM [Source] INNER JOIN [Target] [T] ON .[ID] = [T].[ID]

    EXCEPT SELECT [T].[ID] , [T].[Item], [T].[Price], [T].[OrderDate] , [T].[Units], [T].[ShipmentDate]

    FROM [Target] [T] INNER JOIN [Source] ON [T].[ID] = .[ID];

    ----- SQL2 ------

    SELECT .[ID],.[Item], .[Price], .[OrderDate], .[Units], .[ShipmentDate] FROM [Source]

    EXCEPT SELECT [T].[ID] , [T].[Item], [T].[Price], [T].[OrderDate] , [T].[Units], [T].[ShipmentDate]

    FROM [Target] [T]

    ----- SQL3 ------

    SELECT .[ID],.[Item], .[Price], .[OrderDate], .[Units], .[ShipmentDate] FROM [Source]

    EXCEPT SELECT [T].[ID] , [T].[Item], [T].[Price], [T].[OrderDate] , [T].[Units], [T].[ShipmentDate]

    FROM [Target] [T] INNER JOIN [Source] ON [T].[ID] = .[ID];

    --The RESULTS

    -- Messages

    ----- SQL1 ------

    (3 row(s) affected)

    Table 'Target'. Nombre d'analyses 0, lectures logiques 24...

    Table 'Source'. Nombre d'analyses 1, lectures logiques 14...

    ----- SQL2 ------

    (3 row(s) affected)

    Table 'Target'. Nombre d'analyses 1, lectures logiques 13...

    Table 'Source'. Nombre d'analyses 1, lectures logiques 2...

    ----- SQL3 ------

    (3 row(s) affected)

    Table 'Target'. Nombre d'analyses 0, lectures logiques 12...

    Table 'Source'. Nombre d'analyses 1, lectures logiques 14...

    --The execution plan

    Query 1 : Query cost (relative to the batch) : 45%

    Query 2 : Query cost (relative to the batch) : 21%

    Query 3: Query cost (relative to the batch) : 34%

    So what about this results ?

    the only problem with this test is that the three queries will return different results if you revise the source and target tables to cover the full range of scenarios. You need rows in Source that are not in Target, and you need rows in Target that are not in Source. The goal is to find only the rows that are in both, and have differences. We don't want to find the new rows in Source, and we don't want to find the rows in Target that are not in the new Source data.

  • "the only problem with this test is that the three queries will return different results if you revise the source and target tables to cover the full range of scenarios. You need rows in Source that are not in Target, and you need rows in Target that are not in Source. The goal is to find only the rows that are in both, and have differences. We don't want to find the new rows in Source, and we don't want to find the rows in Target that are not in the new Source data."

    Ok, sorry. I now understand the subtlety

    I think that "My English is not as rich as my tailor"

  • I get the same result doing this simple query:

    select *

    from #source

    except

    select *

    from #target

    Am I missing the purpose of this demonstration?

  • Run the script below in a query analyzer on a test database

    I changed the names from #source and #target to ExceptTestSource and ExceptTestTarget, but everything else is the same.

    Please note the addition of the INTERSECT example as well. It may help provide some clarity around why the joins are necessary.

    It is also important to pay close attention to the definition of EXCEPT (and INTERSECT for that matter)

    EXCEPT returns any distinct values from the left query that are not also found on the right query.

    INTERSECT returns any distinct values that are returned by both the query on the left and right sides of the INTERSECT operand.

    NOTE: Be sure and take the author's advice and add one more row into the source table and then the different results will be more obvious when using the joins.

    CREATE TABLE [ExceptTestSource]

    (

    [ID] INT NOT NULL

    PRIMARY KEY

    IDENTITY(1, 1)

    , [Item] VARCHAR(100) NOT NULL

    UNIQUE

    , [Price] MONEY NOT NULL

    , [OrderDate] DATE NOT NULL

    , [Units] NUMERIC(10, 4) NULL

    , [ShipmentDate] DATE NULL

    );

    CREATE TABLE [ExceptTestTarget]

    (

    [ID] INT NOT NULL

    PRIMARY KEY

    , [Item] VARCHAR(100) NOT NULL

    UNIQUE

    , [Price] MONEY NOT NULL

    , [OrderDate] DATE NOT NULL

    , [Units] NUMERIC(10, 4) NULL

    , [ShipmentDate] DATE NULL

    );

    INSERT INTO [ExceptTestSource]

    ( [Item], [Price], [OrderDate], [Units], [ShipmentDate] )

    VALUES ( 'Apple', 2.49, '1/1/2001', NULL, '1/02/2001' )

    , ( 'Coconut', 0.99, '3/3/2003', 1.35, '3/4/2003' )

    , ( 'Eggplant', 1.19, '5/5/2005', NULL, '5/6/2005' )

    , ( 'Fig', 0.49, '6/6/2006', NULL, '6/7/2006' )

    , ( 'Kiwi', 0.69, '11/11/2011', NULL, '11/12/2011' )

    , ( 'Lychee', 0.29, '12/12/2012', NULL, '12/14/2012' );

    INSERT INTO [ExceptTestTarget]

    ( [ID]

    , [Item]

    , [Price]

    , [OrderDate]

    , [Units]

    , [ShipmentDate]

    )

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [ExceptTestSource] ;

    UPDATE ExceptTestSource

    SET [ShipmentDate] = NULL

    WHERE [Item] = 'Coconut';

    UPDATE ExceptTestSource

    SET [Price] = 22.44

    WHERE [Item] = 'Eggplant';

    UPDATE ExceptTestSource

    SET [Units] = 1000.55

    WHERE [Item] = 'Lychee';

    INSERT EXTRA ROW INTO SOURCE TABLE HERE FOR AN ID THAT IS NOT IN TARGET TABLE

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM ExceptTestSource

    INNER JOIN ExceptTestTarget [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM ExceptTestTarget [T]

    INNER JOIN ExceptTestSource

    ON [T].[ID] = .[ID];

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM ExceptTestSource

    INNER JOIN ExceptTestTarget [T]

    ON .[ID] = [T].[ID]

    INTERSECT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM ExceptTestTarget [T]

    INNER JOIN ExceptTestSource

    ON [T].[ID] = .[ID];

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM ExceptTestSource

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM ExceptTestTarget [T]

  • Yes, young Grasshopper. The example doesn't demonstrate the purpose of the article. If new records were inserted into one of the tables, or records deleted from either table, then the example query would be different than just normal EXCEPT usage.

Viewing 15 posts - 16 through 30 (of 72 total)

You must be logged in to reply to this topic. Login to reply