RE: Eliminating Duplicate Rows using The PARTITION BY clause

SSCarpal Tunnel

Points: 4074

September 22, 2010 at 7:51 am

We have a fairly standard real life example where this would be useful and not a indication of poor table design. We get daily data feeds from third party vendors that conduct customer service surveys for us. Our business rule is that we only contact customers once every 90 days. Due to the distributed nature of the dialing/survey process, timing delays on the data transfer to/from our vendors, and plain old user error, sometimes a customer will be contacted more than once either on the same day or within the 90 window, and we need to remove the "duplicate" survey from our system and notify the call center that they violated this rule so they can remove the record and reset their quotas.

We load the data into a staging table and evaluate within the for duplicate records based on customer id - which is different from our primary key (survey id). Customer ID cannot be the primary key because customers can have multiple surveys but just not within 90 days of each other. We don't use the exact code posted in the article since we can identify unique rows via the primary key. When a "duplicate" is found in the staging data, the offending row is copied to an audit table and removed from the staging table. Our loading process then sends out the necessary communications to the vendor and continues on with loading the cleansed data.

In a perfect world our vendors would always follow our business rules - but you have to code for reality.