Removing Duplicates

  • curious_sqldba (10/10/2013)


    Very nice article and good explanation. Can you please add your thoughts related to performance and is this the recommended way to do it?

    I find the performance to be quite good, I have this code or similar code in production in a number of places and I've never had issues with speed. I haven't had the opportunity to compare its performance to a CTE (shouldn't be much difference at all) or a cursor (could be a large difference if there's a large number of records). I don't know if it is THE recommended way to do it, but it is the way I would recommend to do it.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Nice introductory article Stefan. Good job!

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I tried this solution on table with over 7 million results. (Around a 100.000 duplicates each) however the query did run for around 57 minutes then my server crashed and the database I was performing it on crashed and went into recovery mode... (needed to reboot the server to made it available again :-)).

    So the function is great for smaller databases but not a real solution in my case for a database with 7+ million results. (In the end I had a total of 1654 rows left, so yeah ALLOT of duplicates). In my case I had the luck I could run:

    SELECT

    DISTINCT CreateStatement,

    dbo.split(dbo.split(CreateStatement, '[', 2), ']') AS Tablename

    FROM

    CB_I_Drop_Create_Indexes

    Above query took me arround 4 minutes to complete the same job. 🙂 (In the createStatement column the tablename for the index was in that column row). Hope someone also has some use for it.

  • prennings (10/15/2013)


    I tried this solution on table with over 7 million results. (Around a 100.000 duplicates each) however the query did run for around 57 minutes then my server crashed and the database I was performing it on crashed and went into recovery mode... (needed to reboot the server to made it available again :-)).

    So the function is great for smaller databases but not a real solution in my case for a database with 7+ million results. (In the end I had a total of 1654 rows left, so yeah ALLOT of duplicates). In my case I had the luck I could run:

    SELECT

    DISTINCT CreateStatement,

    dbo.split(dbo.split(CreateStatement, '[', 2), ']') AS Tablename

    FROM

    CB_I_Drop_Create_Indexes

    Above query took me arround 4 minutes to complete the same job. 🙂 (In the createStatement column the tablename for the index was in that column row). Hope someone also has some use for it.

    I think in that case you might even be better off just copying the good rows to a new table, dropping the old one and renaming it. How did you end up with a table that's that dirty?

    I don't know that this query would crash your system or make the database unavailable. I'd think that for that to happen there'd have to be some configuration problems with the server and for it to be a rather small server.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • prennings (10/15/2013)


    I tried this solution on table with over 7 million results. (Around a 100.000 duplicates each) however the query did run for around 57 minutes then my server crashed and the database I was performing it on crashed and went into recovery mode... (needed to reboot the server to made it available again :-)).

    So the function is great for smaller databases but not a real solution in my case for a database with 7+ million results. (In the end I had a total of 1654 rows left, so yeah ALLOT of duplicates). In my case I had the luck I could run:

    SELECT

    DISTINCT CreateStatement,

    dbo.split(dbo.split(CreateStatement, '[', 2), ']') AS Tablename

    FROM

    CB_I_Drop_Create_Indexes

    Above query took me arround 4 minutes to complete the same job. 🙂 (In the createStatement column the tablename for the index was in that column row). Hope someone also has some use for it.

    I would love to see the split function. Any chance of you posting it, please? Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Great article, thanks! Will definitely use it.

  • http://deleteduplicaterecords.blogspot.in/

    You get all answers here

  • Here is another way to delete dups.

    WITH dup

    AS (Select Policy, ROW_NUMBER() OVER (PARTITION BY agentnum, policy ORDER BY ( SELECT 0)) RN

    FROM dbo.YourTable)

    DELETE FROM dup

    WHERE RN > 1

  • jeknight44 (1/23/2015)


    Here is another way to delete dups.

    WITH dup

    AS (Select Policy, ROW_NUMBER() OVER (PARTITION BY agentnum, policy ORDER BY ( SELECT 0)) RN

    FROM dbo.YourTable)

    DELETE FROM dup

    WHERE RN > 1

    This thread is about a question of the day (some sort of quiz) from 1.5 years ago.

    I think most people here thus already know how to remove duplicates.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Sometimes the easiest way is not the most elegant. When I just have one set of duplicates I select the date, stash it on a nopepad or excel, delete all the duplicate records and re-insert it with a paste using the 'Edit top 200' feature in MS-SMS.

    I find this thread has very helpful for when you have multiple dupe records in your table that was poorly designed without an ID or key constraint preventing dupe records.

  • nicolas_johnson 52979 (1/23/2015)


    Sometimes the easiest way is not the most elegant. When I just have one set of duplicates I select the date, stash it on a nopepad or excel, delete all the duplicate records and re-insert it with a paste using the 'Edit top 200' feature in MS-SMS.

    I find this thread has very helpful for when you have multiple dupe records in your table that was poorly designed without an ID or key constraint preventing dupe records.

    Sometimes the problem is that the data you receive doesn't have a distinct ID or you get the data twice. Your table can be perfect, but if the data coming in isn't, this will keep your system from crashing. And it is something you can automate, so you don't have to manually process any duplicates you have. When I set up my imports to dedupe this way, I rarely have to pay attention to duplicates after that.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Shankar Walvekar (1/23/2015)


    http://deleteduplicaterecords.blogspot.in/

    You get all answers here

    Why? Sioban's article has all the answers along with reasons why it works.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (1/23/2015)


    Shankar Walvekar (1/23/2015)


    http://deleteduplicaterecords.blogspot.in/

    You get all answers here

    Why? Sioban's article has all the answers along with reasons why it works.

    Thanks Jeff.

    I feel like I always see someone self-promoting like this in the comments. That's why I didn't bother to respond. Or click through. : -)

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Hi,

    I some times use cte to delete duplicates. It works fine.

    Paul

  • pmusiej (1/23/2015)


    Hi,

    I some times use cte to delete duplicates. It works fine.

    Paul

    Sure! That's a way you can do it. Want to post some code to share with people? I find multiple ways of doing things is both nice to know and can help if one way runs into trouble.

    For example, sometimes CTEs can run long on large datasets.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 31 through 45 (of 52 total)

You must be logged in to reply to this topic. Login to reply