Delete Duplicate values

  • Comments posted to this topic are about the item Delete Duplicate values

  • Abhijit,

    just give a short explanation of your topic, so it will be more clear to all.

    Cheers!

    Sandy.

    --

  • I agree that you need to write something to go with your script.

    And I believe that your script is slightly flawed, it looks like it will only delete the first duplicate record of each duplicate set and will leave all the rest intact (your test data only had 2 duplicates in any set, as far as I noticed, so it worked fine for that data set). I would suggest making the following minor change, which would keep the first record of each duplicate set and delete all others (the first one in theory anyway, SQL gives no guarantee on order of duplicate records, and how would you know anyway? 😉 ).

    DELETE FROM cteDV WHERE RID <> 1

    You could also saved a keystroke and said '> 1', but both will do the same thing in this case.

    For a quick demo using asterisk in the select statement of your CTE works but I would suggest spelling out the columns so that the code doesn't break should the source table structure change.

    Over all a nice little script for getting rid of annoying duplicate records, thank you for sharing it with us.

  • Kenney,

    I appreciate your suggestion. thank you :).

    Abhijit - http://abhijitmore.wordpress.com

  • Just to give physical representation of Kenny's suggestions...

    --===== Create a temporary demonstration table and populate it with duplicated info

    -- This is not part of the solution... it's just creating a demo.

    -- This DROP is just to support reruns for test purposes

    IF OBJECT_ID('tempdb..#DuplicateVals') IS NOT NULL

    DROP TABLE #DuplicateVals

    CREATE TABLE #DuplicateVals

    (

    Col1VARCHAR(3),

    Col2VARCHAR(3),

    Col3VARCHAR(3)

    )

    INSERT INTO #DuplicateVals

    (Col1,Col2,Col3)

    SELECT 'aa1','aa1','aa1' UNION ALL --More than 1 duplicate from here...

    SELECT 'aa1','aa1','aa1' UNION ALL

    SELECT 'aa1','aa1','aa1' UNION ALL

    SELECT 'aa1','aa1','aa1' UNION ALL

    SELECT 'aa1','aa1','aa1' UNION ALL

    SELECT 'aa1','aa1','aa1' UNION ALL --... to here

    SELECT 'aa2','aa2','aa2' UNION ALL

    SELECT 'aa2','aa2','aa2' UNION ALL

    SELECT 'aa3','aa3','aa3' UNION ALL --No duplicates. Will it survive?

    SELECT 'aa4','aa4','aa4' UNION ALL

    SELECT 'aa4','aa4','aa4' UNION ALL

    SELECT 'aa5','aa5','aa5' UNION ALL --No duplicates. Will it survive?

    SELECT 'aa6','aa6','aa6' UNION ALL --More than 1 duplicate from here...

    SELECT 'aa6','aa6','aa6' UNION ALL

    SELECT 'aa6','aa6','aa6' --... to here

    --===== Sanity check: Display the contents of the test table BEFORE the deletion

    SELECT * FROM #DuplicateVals

    --===== Delete duplicates without any temporal information present.

    -- Note that there isn't actually any reason to return any columns from the table in the CTE...

    -- Just the RowNum will suffice.

    ;WITH cteDV AS

    (

    SELECTROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY Col1, Col2, Col3 ) AS RowNum

    FROM#DuplicateVals

    )

    DELETE FROM cteDV WHERE RowNum > 1

    --===== Sanity check: Display the contents of the test table AFTER the deletion

    SELECT * FROM #DuplicateVals

    ... the details are in the code but I wanted to point out that you don't really need to return any of the table's columns in the CTE for this method to work... just the RowNum.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great comments everyone.

    This code introduced me to the WITH statement. perhaps I had seen it before, but I apparently put it out of my mind.

    From what I've read, this common table expression acts like a view inside a query.

    I've never been able to get this straight, but since it acts like a view, is that why deleting items from the cteDV will delete items from the base temp table, or is this something specific to the WITH statement? :unsure:

    I got my information from the following site: http://www.valentina-db.com/dokuwiki/doku.php?id=paradigma:public:en:releases:2.5:statement_with_option

    Keith Wiggans

  • Very clever! I never thought to delete from a CTE!

  • Vert Cool~~

    But I still cannot understand why deleting items from the cteDV will delete items from the base temporary table, or is this something specific to the CTE statemen.

    So amazing and so confusing.

    Can anyone give me the answer? Many thanks

  • Cade.Bu (10/10/2008)


    Vert Cool~~

    But I still cannot understand why deleting items from the cteDV will delete items from the base temporary table, or is this something specific to the CTE statement.

    So amazing and so confusing.

    Can anyone give me the answer? Many thanks

    Cade.Bu,

    It's like kwiggans said, a CTE is like a View:

    kwiggans (10/10/2008)


    Great comments everyone.

    This code introduced me to the WITH statement. perhaps I had seen it before, but I apparently put it out of my mind.

    From what I've read, this common table expression acts like a view inside a query.

    I've never been able to get this straight, but since it acts like a view, is that why deleting items from the cteDV will delete items from the base temp table, or is this something specific to the WITH statement? :unsure:

    Since I'm different from most (or is that contrary? :hehe: ), I like to think of CTEs as derived tables that I can alias then refer to as many times in my query as I want simply by using the alias. Actually, an argument can be made that a View is simply a derived table that you can access with out defining it for each query, you only define it the one time. But I digress, back to the issue.

    So, a CTE works pretty much like a local view, and follows the same rules as a view in order for it to be updateable. Which means, an updateable CTE can be used just like an updateable view, you can perform Select, Insert or Delete queries on them.

    Kwiggans,

    The With statement is just the mechanism used to define the CTE. It's the fact that the CTE acts like a View that allows you to delete records through it.

    I hope I helped more than I befuddled,

    Kenney

    Alternate Method:

    This example can be done without the use of a CTE as well. All you have to do is to replace the CTE code and the delete statement with this piece of code that uses a derived table in a Delete statement to achieve the same goal:

    DELETE NumberedDups

    FROM (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY Col1, Col2, Col3 ) AS RowNum

    FROM #DuplicateVals) AS NumberedDups

    WHERE RowNum > 1

    Performance Note:

    An article I've read stated and showed that a CTE is faster than using a temp table, and I've found this to be true up to a point. The temp table is slower mostly due to the hard drive access need to create it in the first place, but this is just a onetime hit. The problem with the CTE is that SQL runs it each time you reference it in your query, so if your CTE takes 3 seconds to run and you use it 5 times in your query then you're looking at up to 15 seconds (or less due to caching) just for the CTE to be ran all 5 times and then you still have to add the time it takes for the rest of your query to run. But if you used a temp table then after the first hit of creating the temp table you start gaining time because it takes a lot less time to retrieve the records from a temp table than it does to reproduce the result set each time VIA a CTE (or any other query mechanism).

    What does this mean? It means that if your CTEs are slow you will want to think twice about using them if you need to reference them multiple times in the same query batch.

    Extra Study Idea:

    You might want to look into a cool feature of CTEs. They can refer to themselves from within their own With statement (this is a recursive CTE). Books on Line has a little piece on this functionality where they show you how to get a list of someone(s) supervisor and their supervisor and their supervisor, etc. I think it's well worth a look since it opens up a path to do complicated things in a less complicated way.

  • Hi Kenney,

    [font="Verdana"]I see..Thank you very much for your explanation.:)[/font]

    Cade

  • Outstanding explanation, Kenney.

    On the recursive thing... do keep in mind that recurssion is nothing more than a loop and has the same performance impedements as a loop. In other words, it IS a form of RBAR. There are many ways to avoid the use of any type of RBAR on many things. Even when it comes to Hierarchies, there are some really good ways to precalculate the hierarchy (nested sets) and then do the lookups using set based technologies instead of RBAR.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thank you Jeff.

    And thanks for pointing out the RBAR aspect of recursive queries. I'm still not very good at spotting hidden RBAR, and you are the master of that subject. 🙂 I remember reading an article by you about triangular joins being RBAR, it was very enlightening.

    Do you happen to have a posted example of a non-RBAR hierarchal query? I'd really like to see how that is done. I've been very impressed by your postings that I have read.

    Thanks again,

    Kenney

  • Kenny,

    Thanks for pointing out the performance implications of using a CTE versus a temp table multiple times in a query.

  • Kenney Hill (10/11/2008)


    Thank you Jeff.

    And thanks for pointing out the RBAR aspect of recursive queries. I'm still not very good at spotting hidden RBAR, and you are the master of that subject. 🙂 I remember reading an article by you about triangular joins being RBAR, it was very enlightening.

    Do you happen to have a posted example of a non-RBAR hierarchal query? I'd really like to see how that is done. I've been very impressed by your postings that I have read.

    Thanks again,

    Kenney

    You bet...

    http://www.ibase.ru/devinfo/DBMSTrees/sqltrees.html

    http://www.codeproject.com/KB/database/nestedsets.aspx

    What I end up doing is preserving both models on hierarchies that don't change much (most don't). People have an easier time maintaining and thinking about the adjacency model and the code runs better on the nested set model.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Abhijit,

    Great Article.

    Moreover, nicely explained and modified by Jeff.

    Really appriciate.

    🙂

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply