Using EXCEPT to Determine Row Differences

  • FootyRef (3/25/2014)


    I get the same result doing this simple query:

    select *

    from #source

    except

    select *

    from #target

    Am I missing the purpose of this demonstration?

    Yes. When comparing very large data sets, your example would perform poorly. See my earlier post for more details.

  • I guess I am not looking at the performance but looking at the results. Why dowe need to do the join when either query produces the same results?

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    INNER JOIN [#Target] [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    INNER JOIN [#Source]

    ON [T].[ID] = .[ID];

    -- OR --

    select *

    from #source

    except

    select *

    from #target

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

  • FootyRef (3/25/2014)


    I guess I am not looking at the performance but looking at the results. Why dowe need to do the join when either query produces the same results?

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    INNER JOIN [#Target] [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    INNER JOIN [#Source]

    ON [T].[ID] = .[ID];

    -- OR --

    select *

    from #source

    except

    select *

    from #target

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

    The example doesn't demonstrate the purpose of the article. If new records were inserted into one of the tables, or records deleted from either table, then the example query would be different than the EXCEPT usage you have shown...

    select *

    from #source

    except

    select *

    from #target

  • Yes. When comparing very large data sets, your example would perform poorly. See my earlier post for more details.

    I personally have found this method to be very fast. Admittedly I haven't compared them, though I didn't have tables to join. Are the temp tables the issue? I was comparing from a staging table to a dimension table as part of a warehouse ETL process.

  • FootyRef (3/25/2014)


    I guess I am not looking at the performance but looking at the results. Why do we need to do the join when either query produces the same results?

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    INNER JOIN [#Target] [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    INNER JOIN [#Source]

    ON [T].[ID] = .[ID];

    -- OR --

    select *

    from #source

    except

    select *

    from #target

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

    If you are not concerned about performance, and the two queries produce the same results, then you have the luxury of choosing either one you want. Typically, in really world situations, we need queries to not only return the correct results, but also return the results in a reasonable amount of time.

  • RonKyle (3/25/2014)


    Yes. When comparing very large data sets, your example would perform poorly. See my earlier post for more details.

    I personally have found this method to be very fast. Admittedly I haven't compared them, though I didn't have tables to join. Are the temp tables the issue? I was comparing from a staging table to a dimension table as part of a warehouse ETL process.

    Create an execution plan with your current query and then try adding the joins and re-create the execution plan. I think you will find that the query with the joins will have a much lower cost.

  • Hi. Someone already asked this question earlier, but it was ignored. So, I'll ask again. How does this method compares to the use of a MERGE command ? The example in this article describes the method of finding the changes in the matching records. The next step would be to APPLY those changes to the target table. Then, we would have to run similar statements using EXCEPT to find new records in the Source table and then INSERT them into the Target table. Then another run with the EXCEPT to find DELETED records and apply that to a Target table. So, what one MERGE command does in one step, we would have to go through 6 different steps to achieve the same results. So, my question is, is it worth it ?

  • To make it clear why the JOINs are there, you should add to the end of the example:

    DELETE FROM #Source WHERE item = 'Apple'

    DELETE FROM #Target WHERE item = 'Kiwi'

    Then run both "SELECT EXCEPT" statements with and without joins:

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    GO

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    INNER JOIN [#Target] [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    INNER JOIN [#Source]

    ON [T].[ID] = .[ID];

    Results:

    ID Item Price OrderDate Units ShipmentDate

    ----------- -------------------- --------------------- ---------- --------------------------------------- ------------

    2 Coconut 0.99 2003-03-03 1.3500 NULL

    3 Eggplant 22.44 2005-05-05 NULL 2005-05-06

    5 Kiwi 0.69 2011-11-11 NULL 2011-11-12

    6 Lychee 0.29 2012-12-12 1000.5500 2012-12-14

    (4 row(s) affected)

    ID Item Price OrderDate Units ShipmentDate

    ----------- -------------------- --------------------- ---------- --------------------------------------- ------------

    2 Coconut 0.99 2003-03-03 1.3500 NULL

    3 Eggplant 22.44 2005-05-05 NULL 2005-05-06

    6 Lychee 0.29 2012-12-12 1000.5500 2012-12-14

    (3 row(s) affected)

  • I came across the same technique here http://davidchuprogramming.blogspot.com.au/2008/07/comparing-two-tables-by-sql-stored.html

    and David went the extra mile, turned it into an SP, allowed the user to select Fieldnames, OrderBy and a where Clause.

    There really is nothing new under the sun - his post was from July 2008

  • If you have the query...

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T];

    And are thinking about speeding up the query as suggested by darrenwhite by adding the INNER JOINS as shown below...

    SELECT .[ID]

    , .[Item]

    , .[Price]

    , .[OrderDate]

    , .[Units]

    , .[ShipmentDate]

    FROM [#Source]

    INNER JOIN [#Target] [T]

    ON .[ID] = [T].[ID]

    EXCEPT

    SELECT [T].[ID]

    , [T].[Item]

    , [T].[Price]

    , [T].[OrderDate]

    , [T].[Units]

    , [T].[ShipmentDate]

    FROM [#Target] [T]

    INNER JOIN [#Source]

    ON [T].[ID] = .[ID];

    Just be aware... These two queries are not the same.

    They can return different result sets depending on the data.

    Let's take the example given in the article.

    When the query above is ran without the inner joins we get the following result set.

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

    But make the following changes.

    INSERT INTO [#Source]

    ( [Item], [Price], [OrderDate], [Units], [ShipmentDate] )

    VALUES ( 'Prune', 3.49, '1/10/2001', 2, '1/12/2001' )

    DELETE FROM [#Target] WHERE [Item] = 'Fig'

    When the procedure without the joins is ran, the following results are returned.

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    4Fig0.492006-06-06NULL2006-06-07

    6Lychee0.292012-12-121000.55002012-12-14

    7Prune3.492001-01-102.00002001-01-12

    When the procedure with the joins is ran, the results below are returned.

    IDItemPriceOrderDateUnitsShipmentDate

    2Coconut0.992003-03-031.3500NULL

    3Eggplant22.442005-05-05NULL2005-05-06

    6Lychee0.292012-12-121000.55002012-12-14

    This difference shown above is the purpose of the article.

    It was just is not demonstrated in the example.

  • Hi again. I hope my posts are not invisible. But everyone just keeps ignoring it. Are there no experts here that could answer this question ???

    Again, someone already asked this question earlier, but it was ignored. So, I'll ask again. How does this method compares to the use of a MERGE command ? The example in this article describes the method of finding the changes in the matching records. The next step would be to APPLY those changes to the target table. Then, we would have to run similar statements using EXCEPT to find new records in the Source table and then INSERT them into the Target table. Then another run with the EXCEPT to find DELETED records and apply that to a Target table. So, what one MERGE command does in one step, we would have to go through 6 different steps to achieve the same results. So, my question is, is it worth it ?

  • So, I'll ask again. How does this method compares to the use of a MERGE command

    I don't know that where I could use the MERGE I would use EXCEPT. I've used EXCEPT a couple of times on permanent queries. The first was to take a list of thousands of phone numbers and add only the new ones to a list containing millions of phone numbers. The insert performed so quickly that I saw no need to try to find a different way to do it. A merge would have worked, but it would have been more complicated to write. In the second case I simply wanted to know the row difference and export that difference to a flat file. A merge would have not suited this situation. I wouldn't use EXCEPT if I then wanted to imediately do row updates.

    Other than these uses, I've only used it to add transactional records in production to test to bring the test database up to date.

  • @victordba, the use of EXCEPT is a valid tool for investigative purposes. Running a MERGE is essentially an upsert, which you don't want if you are just comparing data.

    As an example--and a real-world test for medium-size data--I ran a query that looked like this:

    with my2009 as (

    select id,cty,name,amt

    from mytable

    where somefactor = 2009

    ),

    my2008 as (

    select id,cty,name,amt

    from mytable

    where somefactor = 2008

    )

    select

    m9.id,

    m9.amt

    from

    my2009 m9

    join my2008 m8 on m8.id = m9.id

    except

    select

    m8.id,

    m8.amt

    from

    my2008 m8

    join my2009 m9 on m9.id = m8.id

    Each CTE pulls from the same base table. The EXCEPT clause performs a comparison between them, telling me which people (id=person) earned a different amount between 2009 and 2008.

    The sets for 2009 and 2008 are approximately 25000 each.

    The common table holds approximately 160000 records.

    I used the server on our test site, which runs much faster than my local server. (Ipso facto, performance is most likely average.)

    With no indexes, this query took seven minutes (actually 7:08).

    Now, this seems like a very long way to get monetary information, which would probably be quicker with a UNION or just a plain diff on two text files.

    Our department's situation, however, demands a lot of analysis, so we are darn good with text diffs and pulling bits of information out of some large tables. One of those tables grows about 5 million records each month, so analysis often becomes tedious, if not nearly impossible. The query I included above is a VERY simple representation of what we actually run.

    ***

    I found this article very interesting, because I have run into several situations where I need to know what changed in some arbitrary field in a big table between months or years. When I pull down a million records--twice!--into text files, I am often overwhelmed by just the differences a text comparison creates. Using Excel sometimes works. Beyond Compare is a fine tool as well, but....

    My largest challenges come from the early stages in analysis where I have yet to nail down my focus. This article just gave me another tool that I will definitely use.

    Kudos! 🙂

  • Thank you, @james.jensen1350 for your reply.

    I've been using MERGE for many years to do the upsert. And my source and target tables are large, with millions of records. So, MERGE takes it's sweet time to compare large tables to find the differences between those tables and to do upsert. But MERGE is not perfect, and you have to be very careful with it and also not to forget to take case of NULLs in every field. So, I am always on the lookout for some interesting and more efficient ways to do my upserts. And this article got my attention. I did some quick test comparisons between MERGE and EXCEPT to find changes between 2 tables with about 10 million records in each, and EXCEPT took about 35 seconds, while MERGE about 7 minutes. But I just ran EXCEPT once, to find changes in matching records. I would have to do a full test to also find new and missing records, so I would have to run EXCEPT 2 more times, and also run INSERT, UPDATE and DELETE commands when I find those differences. So, more results to follow. I was just wondering if anyone else tried some other ways to do UPSERTs besides MERGE command on large tables.

  • David McKinney (3/25/2014)


    you can use UNION also to do something similar. In this case, it returns both rows when there's a difference.

    SELECT*

    FROM(SELECT*

    ,count(1) OVER (PARTITION BY ID) AS versions

    FROM(SELECT*

    FROM#Source AS ss

    UNION

    SELECT*

    FROM#Target AS ss

    ) x

    ) y

    WHEREversions <> 1

    Now that is very slick! Thank you.

Viewing 15 posts - 31 through 45 (of 72 total)

You must be logged in to reply to this topic. Login to reply