Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Eliminating Duplicate Rows using The PARTITION BY clause Expand / Collapse
Author
Message
Posted Wednesday, September 22, 2010 4:06 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Tuesday, October 19, 2010 3:12 AM
Points: 3, Visits: 17
A minor quibble on the ROW_NUMBER() examples: the ORDER BY clause is pointless since it repeats the PARTITION BY expression list. The ORDER BY list controls the order within each partition, not the whole result set, so none of the expressions in the PARTITION BY list should appear in the ORDER BY list since all rows in a partition share the same values for those expressions.

If you don't want to just remove the ORDER BY clause you could adapt the examples to ORDER BY something that is guaranteed to be unique within a partition, e.g. if Emp_Details has a surrogate key column Emp_ID which is unique you could change your first example to

select Emp_Name
,Company
,Join_Date
,Resigned_Date
,ROW_NUMBER() over (partition by Emp_Name, Company, Join_Date
,Resigned_Date
order by Emp_ID) RowNumber
from Emp_Details
Post #990991
Posted Wednesday, September 22, 2010 5:02 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Friday, September 7, 2012 5:25 AM
Points: 231, Visits: 61
can be re-written like this :
WITH CTE AS
(
SELECT Emp_Name, Company, Join_Date, Resigned_Date, ROW_NUMBER() OVER (order BY Emp_Name, Company, Join_Date, Resigned_Date) AS ROWNUM
from #Emp_Details
)

DELETE FROM CTE
WHERE ROWNUM NOT IN (SELECT MAX(ROWNUM) FROM CTE
GROUP BY Emp_Name, Company, Join_Date, Resigned_Date)
Post #991022
Posted Wednesday, September 22, 2010 6:53 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, May 5, 2014 8:55 AM
Points: 86, Visits: 478
Good article.

Not sure that on larger datasets and/or busier server one would want to DELETE/TRUNCATE records. The risk with DELETE/TRUNCATE is that you change your original content, thus you are unable to recover easily should you discover a problem with the query (that never happens right Also DELETE will create logs which can be quite substantial on larger datasets.

Why not role forward with a SELECT INTO statement, it preserves the Original dataset in the process. Once the result are satisfying the old table can be swapped out and while the users never experienced an outage.

my 2 cents
Post #991099
Posted Wednesday, September 22, 2010 7:07 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, April 24, 2012 5:33 AM
Points: 15, Visits: 120
This appears to be one of those articles that creates a problem based on poor design and purports to correct it using some bloated functionality of SQL Server. If the table were properly designed with uniqueness constraints, the problem would not exist. Allowing duplicate data into the table in the first place is the problem, not fixing it after the fact. The more likely scenario and the one I have seen most often comes from importing data from poorly designed databases or poorly trained users.

One problem I recall in particular involved importing about 7,000 addresses from a client where their data entry people had cleverly but clumsily forced data into the table to violate uniqueness contraints. Apparently, the data entry people, upon receiving some message that was totally cryptic to them about "duplicate keys" simply changed a small piece of information to force the data into the table. For example, one address appeared multiple times by simply changing "Street" to "St.", "Str.", "St", "Str", "ST" and even "S t r e e t". That is the kind of duplication that is a problem, not the oversimplified example of this article.

It seems to me we have entered a phase with SQL Server fuctions in search of problems to justify their cost and footprint, no matter how contrived those problems may be.

Post #991117
Posted Wednesday, September 22, 2010 7:41 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Yesterday @ 8:16 AM
Points: 83, Visits: 743
@trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.
Post #991158
Posted Wednesday, September 22, 2010 7:51 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Yesterday @ 6:54 AM
Points: 394, Visits: 1,605
We have a fairly standard real life example where this would be useful and not a indication of poor table design. We get daily data feeds from third party vendors that conduct customer service surveys for us. Our business rule is that we only contact customers once every 90 days. Due to the distributed nature of the dialing/survey process, timing delays on the data transfer to/from our vendors, and plain old user error, sometimes a customer will be contacted more than once either on the same day or within the 90 window, and we need to remove the "duplicate" survey from our system and notify the call center that they violated this rule so they can remove the record and reset their quotas.

We load the data into a staging table and evaluate within the for duplicate records based on customer id - which is different from our primary key (survey id). Customer ID cannot be the primary key because customers can have multiple surveys but just not within 90 days of each other. We don't use the exact code posted in the article since we can identify unique rows via the primary key. When a "duplicate" is found in the staging data, the offending row is copied to an audit table and removed from the staging table. Our loading process then sends out the necessary communications to the vendor and continues on with loading the cleansed data.

In a perfect world our vendors would always follow our business rules - but you have to code for reality.



Post #991180
Posted Wednesday, September 22, 2010 7:51 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Tuesday, April 24, 2012 5:33 AM
Points: 15, Visits: 120
thisisfutile (9/22/2010)
@trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.


No mal-intentions assumed, but the idea this "code" is efficient is ludicrous because is suggests it must be run periodically. So what happens to all the bad results in between? Don't worry about those erroneous reports, we are going run our very efficient "clean-up" code shortly? Do you run this with each data entry?

You are correct, it is a problem to deal with but it is not a fix nor is it viable for a production scenario. My own experience indicates that datra cleansing is a highly variable problem and requires ingenuity and coding expertise, but it is also a one-shot deal.
Post #991181
Posted Wednesday, September 22, 2010 8:00 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Monday, October 20, 2014 11:55 AM
Points: 1,330, Visits: 19,306
trubolotta (9/22/2010)
thisisfutile (9/22/2010)
@trubolotta Not trying to start an argument here, but hindsight is 20/20 for everyone. Sometimes we find ourselves in a situation like the OP is describing (or something similar) and we need a solution.


No mal-intentions assumed, but the idea this "code" is efficient is ludicrous because is suggests it must be run periodically. So what happens to all the bad results in between? Don't worry about those erroneous reports, we are going run our very efficient "clean-up" code shortly? Do you run this with each data entry?

You are correct, it is a problem to deal with but it is not a fix nor is it viable for a production scenario. My own experience indicates that datra cleansing is a highly variable problem and requires ingenuity and coding expertise, but it is also a one-shot deal.
Not everyone who frequents SSC is a DBA or has the ability to change table structures. Some of us just need to be able to quickly remove dupes from resultsets for reporting, and not all columns are able to apply a DISTINCT to it. This is an easy solution that does the job well.


---------------------------------------------------------
How best to post your question
How to post performance problems
Tally Table:What it is and how it replaces a loop

"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."
Post #991188
Posted Wednesday, September 22, 2010 8:04 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, August 27, 2013 4:57 AM
Points: 28, Visits: 124
trubolotta (9/22/2010)
This appears to be one of those articles that creates a problem based on poor design and purports to correct it using some bloated functionality of SQL Server. If the table were properly designed with uniqueness constraints, the problem would not exist. Allowing duplicate data into the table in the first place is the problem, not fixing it after the fact. The more likely scenario and the one I have seen most often comes from importing data from poorly designed databases or poorly trained users.

It seems to me we have entered a phase with SQL Server fuctions in search of problems to justify their cost and footprint, no matter how contrived those problems may be.


This is not a solution looking for a problem. Almost every OLTP database I see has: lousy performance, dirty data, duplicate rows, poor indexing, inefficient queries, cursors, WITH NOLOCK hints on everything, thin to no security, no integrity checks, & never tested their ability to restore their system.

Often created by developers who "know a little TSQL" or a DBA constantly apologising that this was his first project when he joined the company & he's learnt a lot since then. OR they were really short of time & he just didn't get around to fixing it.

Also I've never done a datawarehouse or BI project where the cube didn't highlight a drastic need to fix up the data they thought was perfect.

So ... It is possible that the only companies that ask me to help with their database, are those who know they have a problem. And don't have a team of skilled DBA's. So the sample I see is likely to skewed.

Or it is possible that you've only worked on well funded projects that benefit from you high level of DBA skills.
Either way, trust me. There is a huge line of people with no idea how to fix their data that will benefit from this thread.


Post #991193
Posted Wednesday, September 22, 2010 8:08 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Yesterday @ 8:16 AM
Points: 83, Visits: 743
trubolotta (9/22/2010)My own experience indicates that datra cleansing is a highly variable problem and requires ingenuity and coding expertise, but it is also a one-shot deal.


Roger that. In my case, just yesterday I created a process to upload new pricing information from one of our vendors. For whatever reason, they usually have duplicated part numbers on the spreadsheet (fortunately with the same pricing information for each) and I've included a step to delete these dupes. I haven't had to deal wtih dupes in so long that I found myself looking for "new and improved" ways to handle the problem. Thought it was very coincidental that in my inbox this morning I got a SqlServerCentral mail discussing the very topic.

While my solution is a "one-time" fix, it must be done every time I work with this vendor's spreadsheets (about once-per-month).
Post #991196
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse