Eliminating Duplicate Rows using The PARTITION BY clause

  • We have a fairly standard real life example where this would be useful and not a indication of poor table design. We get daily data feeds from third party vendors that conduct customer service surveys for us. Our business rule is that we only contact customers once every 90 days. Due to the distributed nature of the dialing/survey process, timing delays on the data transfer to/from our vendors, and plain old user error, sometimes a customer will be contacted more than once either on the same day or within the 90 window, and we need to remove the "duplicate" survey from our system and notify the call center that they violated this rule so they can remove the record and reset their quotas.

    We load the data into a staging table and evaluate within the for duplicate records based on customer id - which is different from our primary key (survey id). Customer ID cannot be the primary key because customers can have multiple surveys but just not within 90 days of each other. We don't use the exact code posted in the article since we can identify unique rows via the primary key. When a "duplicate" is found in the staging data, the offending row is copied to an audit table and removed from the staging table. Our loading process then sends out the necessary communications to the vendor and continues on with loading the cleansed data.

    In a perfect world our vendors would always follow our business rules - but you have to code for reality.

  • thisisfutile (9/22/2010)


    @trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.

    No mal-intentions assumed, but the idea this "code" is efficient is ludicrous because is suggests it must be run periodically. So what happens to all the bad results in between? Don't worry about those erroneous reports, we are going run our very efficient "clean-up" code shortly? Do you run this with each data entry?

    You are correct, it is a problem to deal with but it is not a fix nor is it viable for a production scenario. My own experience indicates that datra cleansing is a highly variable problem and requires ingenuity and coding expertise, but it is also a one-shot deal.

  • trubolotta (9/22/2010)


    thisisfutile (9/22/2010)


    @trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.

    No mal-intentions assumed, but the idea this "code" is efficient is ludicrous because is suggests it must be run periodically. So what happens to all the bad results in between? Don't worry about those erroneous reports, we are going run our very efficient "clean-up" code shortly? Do you run this with each data entry?

    You are correct, it is a problem to deal with but it is not a fix nor is it viable for a production scenario. My own experience indicates that datra cleansing is a highly variable problem and requires ingenuity and coding expertise, but it is also a one-shot deal.

    Not everyone who frequents SSC is a DBA or has the ability to change table structures. Some of us just need to be able to quickly remove dupes from resultsets for reporting, and not all columns are able to apply a DISTINCT to it. This is an easy solution that does the job well.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • trubolotta (9/22/2010)


    This appears to be one of those articles that creates a problem based on poor design and purports to correct it using some bloated functionality of SQL Server. If the table were properly designed with uniqueness constraints, the problem would not exist. Allowing duplicate data into the table in the first place is the problem, not fixing it after the fact. The more likely scenario and the one I have seen most often comes from importing data from poorly designed databases or poorly trained users.

    It seems to me we have entered a phase with SQL Server fuctions in search of problems to justify their cost and footprint, no matter how contrived those problems may be.

    This is not a solution looking for a problem. Almost every OLTP database I see has: lousy performance, dirty data, duplicate rows, poor indexing, inefficient queries, cursors, WITH NOLOCK hints on everything, thin to no security, no integrity checks, & never tested their ability to restore their system.

    Often created by developers who "know a little TSQL" or a DBA constantly apologising that this was his first project when he joined the company & he's learnt a lot since then. OR they were really short of time & he just didn't get around to fixing it.

    Also I've never done a datawarehouse or BI project where the cube didn't highlight a drastic need to fix up the data they thought was perfect.

    So ... It is possible that the only companies that ask me to help with their database, are those who know they have a problem. And don't have a team of skilled DBA's. So the sample I see is likely to skewed.

    Or it is possible that you've only worked on well funded projects that benefit from you high level of DBA skills.

    Either way, trust me. There is a huge line of people with no idea how to fix their data that will benefit from this thread.

  • trubolotta (9/22/2010)My own experience indicates that datra cleansing is a highly variable problem and requires ingenuity and coding expertise, but it is also a one-shot deal.

    Roger that. In my case, just yesterday I created a process to upload new pricing information from one of our vendors. For whatever reason, they usually have duplicated part numbers on the spreadsheet (fortunately with the same pricing information for each) and I've included a step to delete these dupes. I haven't had to deal wtih dupes in so long that I found myself looking for "new and improved" ways to handle the problem. Thought it was very coincidental that in my inbox this morning I got a SqlServerCentral mail discussing the very topic.

    While my solution is a "one-time" fix, it must be done every time I work with this vendor's spreadsheets (about once-per-month).

  • This is a solution that fits some problem domains. One that might have poor database design, or other issues, but that can be a very large domain of systems.

    Disregarding this because you think there is a more fundamental problem is ignoring the reality in many situations. Yes, I'd love to have someone redesign things and make them better, but you can't do that.

    There are things in the db design here at SSC that I'd like changed, or I think are poor design. However I just don't have the resources to deal with that. As a result, I need to find workarounds until such time as I can redesign them.

  • Nice content. Thanks. I agree, this method is not a fixall or a substitution for poor design. However, given certain scenarios, it is very useful and a concise way to eliminate duplicates. We don't all have the luxury of inheriting well architected database structures or being involved in the design process.

    Well done!

  • From some of the replies to my post, I'm seeing that old schism of "code guru" and "design guru" is alive and well. I have never denied duplicates are a problem and require a solution, but that schism gets in the way of offering clients the best solution. If a "code guru" can't see the design problem or doesn't point it out to the designer, or the "design guru" can't see the code burden or inefficiency caused by his design, neither is doing his client or employer a service. What can I sell the client appears to drive the answer, and that is where I have a problem.

    The article is a coding bandaid that indeed may be useful for data cleansing or conditioning "before" the data is placed into production, as one poster with a real life example illustrated with "staging tables". Not once however, does the article suggest the table design is fundamentally flawed or present a realistic scenario that may require such a solution. Duplicates were being removed during the cleansing process quite handily long before the featured SQL Server functions came into existence.

    As to the projects I work on, they are not always well funded but they certainly are well investigated and well thought out before a single table is created or a single line of SQL is written. And no, I do not have "code gurus" or "design gurus" at my disposal. What I do have are people motivated to do the best job for our clients, even if it means not getting a project because a better solution lies elsewhere. Clients don't forget that.

  • Thanks for the article.

    I've used this method a lot since sql 2005 came out and have found it particularily usefull when a) the data comes from a 3rd party and is part of a larger data cleansing that I do. Usually data dumps like the earlier posted mentioned about surveys. b) 3rd party databases that are poorly designed without proper constraints and disasters occur where dupicates somehow get inserted. (This is much more rare).

    ---------------------------------------------------------------------
    Use Full Links:
    KB Article from Microsoft on how to ask a question on a Forum

  • This is a fantastic post, and as a newer DBA, it really clarified exactly how the ROW_NUMBER/PARTITION BY function works. The solution is a perfect fit for me on a project that I am currently working on, where we have data with duplicates coming in from a third party that needs to be entered in to the database. I will be using this function to clean the data. Thanks!

  • Suresh,

    I'm a developer, not a DBA; still, I've been working with RDBMSs since 1984 (IBM DB/2 v1), and I can tell you that yours is not a 'solution looking for a problem', but one very much for the 'real world'. Dirty data will always be with us, and we often have little or no control over those providing it - so deal with it we must!

    It is is, then, with much gratitude that fellows like me read and use articles by fellows like you. Well done, sir.

  • Joe,

    I didn't know that technique - here is your code with the correct columns demonstrating the efficiency.

    -- List all entries regardless of frequency

    SELECT *, 1 AS grp_size FROM Emp_Details

    INTERSECT

    SELECT emp_name, company, join_date, resigned_date,

    ROW_NUMBER()

    OVER (PARTITION BY emp_name, company, join_date, resigned_date

    ORDER BY emp_name, company, join_date, resigned_date) AS grp_size

    FROM Emp_Details;

    -- List all with 2 or more entries

    SELECT *, 2 AS grp_size FROM Emp_Details

    INTERSECT

    SELECT emp_name, company, join_date, resigned_date,

    ROW_NUMBER()

    OVER (PARTITION BY emp_name, company, join_date, resigned_date

    ORDER BY emp_name, company, join_date, resigned_date) AS grp_size

    FROM Emp_Details;

    -- List all with 3 or more entries

    SELECT *, 3 AS grp_size FROM Emp_Details

    INTERSECT

    SELECT emp_name, company, join_date, resigned_date,

    ROW_NUMBER()

    OVER (PARTITION BY emp_name, company, join_date, resigned_date

    ORDER BY emp_name, company, join_date, resigned_date) AS grp_size

    FROM Emp_Details;

    Nice,

    Doug

  • If you work for the bank, you wait for import duplicates then delete, you are dead already.

    If you work for car-wash, it is OK.

    Row_number() function needs order by.

    Never use "Not in" on large table. You are killing yourself or the server.

    http://usa.redirectme.net/repriser/data/WriteTSQLPerformace.html

    There is one more method people have not bought it up in this long discussion. SSMS Edit Rows, right-mouse drop to Delete command one row at a time. (I am just kidding.) My point is some people chew up words and miss the point of the article. Since there only five rows for delete, you can make this joke.

    Can we all stop that endless debate design vs implementation, developer vs DBA now? The author put in a lot work to write a nice article.

  • Nice article:D

    Wt a coincidence 2 days back I tried the same method.

    I was not aware this method of removing duplicated is accepted by lot others as wel:D

  • The code in the article works, but is it enough to leave it at that? If you are stuck with a production database that allows duplicates in tables, you have a problem and should ask how frequently do you need to purge duplicates? Also ask what prevents the data from being corrupted seconds after the purge, and then used to make a critical business decision?

    This is the sample table created by the author, renamed ony to illustrate a point:

    create table Emp_Details_Raw

    (Emp_Name varchar(10)

    ,Company varchar(15)

    ,Join_Date datetime

    ,Resigned_Date datetime

    )

    Of course the table should look more like this to provide unique rows, again the table name was chosen to illustrate a point:

    CREATE TABLE [dbo].[Emp_Details_Unique](

    [Emp_Name] [varchar](10) NOT NULL,

    [Company] [varchar](15) NOT NULL,

    [Join_Date] [datetime] NOT NULL,

    [Resigned_Date] [datetime] NOT NULL

    ) ON [PRIMARY]

    GO

    CREATE UNIQUE NONCLUSTERED INDEX [IX_Emp_Details_Unique] ON [dbo].[Emp_Details_Unique]

    (

    [Emp_Name] ASC,

    [Company] ASC,

    [Join_Date] ASC,

    [Resigned_Date] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    Using the author's code to populate the table Emp_Details_Raw pops the duplicates in without any problem because of the missing constraint. This may be similar to the problem you face with third party data in whatever form you receive it:

    insert into Emp_Details_Raw (Emp_Name, Company, Join_Date, Resigned_Date)

    values ('John', 'Software', '20060101', '20061231')

    ,('John', 'Software', '20060101', '20061231')

    ,('John', 'Software', '20060101', '20061231')

    ,('John', 'SuperSoft', '20070101', '20071231')

    ,('John', 'UltraSoft', '20070201', '20080131')

    ,('John', 'ImproSoft', '20080201', '20081231')

    ,('John', 'ImproSoft', '20080201', '20081231')

    ,('Mary', 'Software', '20060101', '20081231')

    ,('Mary', 'SuperSoft', '20090101', '20090531')

    ,('Mary', 'SuperSoft', '20090101', '20090531')

    ,('Mary', 'UltraSoft', '20090601', '20100531')

    ,('Mary', 'UltraSoft', '20090601', '20100531')

    I have two choices. First, I can delete duplicates from the Emp_Details_Raw table, but wouldn't it be a good idea to back that data up first? I would just for accountability. Once the duplicates are deleted, I can insert the data into my production table.

    My alternative is simply not to insert duplicate records in the production database in the first place:

    INSERT dbo.Emp_Details_Unique

    (Emp_Name, Company, Join_Date, Resigned_Date)

    SELECT DISTINCT

    Emp_Name, Company, Join_Date, Resigned_Date

    FROM dbo.Emp_Details_Raw

    Advantages include less coding, preservation of the original data without creating a duplicate data store and assurance the production table has unique rows at all times. But then again, my solution doesn't offer that delicious complexity some seem to relish or use the whiz-bang new features of SQL Server. To each his own.

Viewing 15 posts - 16 through 30 (of 52 total)

You must be logged in to reply to this topic. Login to reply