Home Forums SQL Server 2008 T-SQL (SS2K8) De-duplicating Customer Detail and assigning parent child structure to multiple customer records RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records

  • Hi Matt

    There's a lot of work involved in getting it just right and how you do it depends on what it's for - if you're doing a one-off process then it would be different to a process running every half hour or overnight.

    A really good start is to inner join the table to itself on postcode, address line 1, surname and the first character of the forename, keeping the ID on the LHS. The capture rate is very good and the false positives are low. Shove the lot into a temp table and reprocess there because it's usually quite expensive. You could, for instance, add more dupes to the temp table based on matches using other criteria.

    At some point you mark sets of potentially-duplicated rows as "dupesets".

    After this you can fine-tune and eliminate poor/ambiguous matches.

    Finally, mark the principal of each dupeset.

    Which part of this overall process are you having difficulty with?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden