﻿<?xml version='1.0' encoding='UTF-8'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"><channel><title>SQLServerCentral / SQL Server 2008 / T-SQL (SS2K8)  / De-duplicating Customer Detail and assigning parent child structure to multiple customer records / Latest Posts</title><generator>InstantForum.NET v2.9.0</generator><description>SQLServerCentral</description><link>http://www.sqlservercentral.com/Forums/</link><webMaster>notifications@sqlservercentral.com</webMaster><lastBuildDate>Tue, 21 May 2013 19:14:03 GMT</lastBuildDate><ttl>20</ttl><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Have you looked at Melissa Data , SSA and other companies that specifies in cleaning up mailing lists and such like?</description><pubDate>Wed, 21 Jul 2010 10:43:11 GMT</pubDate><dc:creator>CELKO</dc:creator></item><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Thanks very much for this. I will work through this to see if it does what i need. I have a feeling that this will only deal with one to one relationship rather than the one to many that is needed. It is difficult to post too much sample data though...I massively appreciate your time. We are working on a BIG update statement wrapped in a big while loop which may work...if i come up with a solution i will let you know!!ThanksMatt</description><pubDate>Tue, 20 Jul 2010 10:31:36 GMT</pubDate><dc:creator>Doylie</dc:creator></item><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Have a play with this, Matt. Hopefully it will give you some ides.[code="sql"]IF OBJECT_ID( 'tempdb..#rawdata') IS NOT NULL DROP TABLE #rawdataIF OBJECT_ID( 'tempdb..#WorkingSet') IS NOT NULL DROP TABLE #WorkingSetCREATE TABLE #rawdata (dateid DATETIME, CustID INT, CCNumber CHAR(9), First_Name VARCHAR(10), Last_Name VARCHAR(10)) INSERT INTO #rawdata (dateid, CustID, CCNumber, First_Name, Last_Name) SELECT '20090503',  7320, '370090510', 'Jane', 'Smith' union SELECT '20090420', 25861, '370090510', 'Jane', 'Smith' union SELECT '20100404',  7320, '370080987', 'Jane', 'Smith' union SELECT '20090703', 25861, '370080987', 'Jane', 'Smith' union SELECT '20100603',  7320, '594220191', 'Jane', 'Smith' union SELECT '20100705', 25861, '594220191', 'Jane', 'Smith' union SELECT '20100719',  7320, '594220654', 'Jane', 'Smith' union SELECT '20100720',  7321, '594220654', 'Jane', 'Smith' SELECT 	DupesetID = CAST(NULL AS BIGINT),	r1.*, 	r2_CustID = r2.CustIDINTO #WorkingSet  FROM #rawdata r1INNER JOIN #rawdata r2 ON r2.CCNumber = r1.CCNumber AND r2.CustID &amp;lt;&amp;gt; r1.CustID ORDER BY r1.CustID-- assign dupesetIDUPDATE d SET DupesetID = s.DupesetIDFROM #WorkingSet d INNER JOIN (	SELECT DupesetID = ROW_NUMBER() OVER (ORDER BY Last_Name, CCNumber), 		Last_Name, CCNumber  	FROM #WorkingSet 	GROUP BY Last_Name, CCNumber) s ON s.Last_Name = d.Last_Name 	AND s.CCNumber = d.CCNumber-- merge sets with same customerID to earliest DupesetIDUPDATE d SET DupesetID = s.DupesetIDFROM #WorkingSet dINNER JOIN (	SELECT CustID, DupesetID = MIN(DupesetID), OLDDupesetID = MAX(DupesetID)	FROM #WorkingSet	GROUP BY CustID	HAVING COUNT(*) &amp;gt; 1 AND MIN(DupesetID) &amp;lt;&amp;gt; MAX(DupesetID)) s ON s.OLDDupesetID = d.DupesetID-- merge again if necessary-- assign principal, and number of rows in dupesetSELECT MasterCustomer = ROW_NUMBER() OVER (PARTITION BY d.DupesetID ORDER BY dateid DESC),	DupesetSize = COUNT(*) over (PARTITION BY d.DupesetID),	d.DupesetID, d.dateid, d.CustID, d.CCNumber, d.First_Name, d.Last_Name, d.r2_CustID  FROM #WorkingSet d ORDER BY DupesetID, MasterCustomer[/code]</description><pubDate>Tue, 20 Jul 2010 08:36:02 GMT</pubDate><dc:creator>ChrisM@Work</dc:creator></item><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>This will be potentially a nightly run that will update the customer details. This table will then have the various passes ran against it to identify changes in the relationships between the master and its children. This could mean customerID's being added, them being removed or changes in which is the primary. It is not the identification of the duplicates, or even the identification of the most recent it is when you try and identify all the children below the primary. If you run the below piece of code then you can see a simplified example (using a horrible correlated sub query) that you are going to be left with two pairs of interchanged ID's. this will leave us once a distinct is applied with this situation :Master   Child7320      258617321      7320how would you go about linking these customers and allowing for all different variations of this problem that will arise. Some customers have used over 19  different credit cards... at the moment we are looking at updates in a big loop but this is going to have to loop through EVERY customer for EVERY credit card which is going to be a massive hit everyday. I was interested to see if anyone had any better suggestions or solutions to a similar problem. ThanksMatt[code="sql"]with  rawdata as(select      '20090503' as dateid            ,7320 as CustID            ,'370090510' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20090420' as dateid            ,25861 as CustID            ,'370090510' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Name            union select      '20100404' as dateid            ,7320 as CustID            ,'370080987' as CCNumber	        ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20090703' as dateid            ,25861 as CustID            ,'370080987' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100603' as dateid            ,7320 as CustID            ,'594220191' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100705' as dateid            ,25861 as CustID            ,'594220191' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100719' as dateid            ,7320 as CustID            ,'594220654' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100720' as dateid            ,7321 as CustID            ,'594220654' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameorder by custid),rawdata2 as(select      custid,MAX(dateid)  as maxdateidfrom  rawdatagroup by custid)SELECT distinct      CASE WHEN CustDate1 &amp;gt; CustDate2 THEN CustID1 ELSE CustID2 END  AS [GrandMasterCustomer]      ,CASE WHEN CustDate1 &amp;gt; CustDate2 THEN CustDate1 ELSE CustDate2 END  AS [GrandMasterDate]      ,CASE WHEN CustDate1 &amp;gt; CustDate2 THEN CustID2 ELSE CustID1 END  AS [SubordinateCustomer]FROM(select      r1.CustID as CustID1      , (SELECT MAX(d1.dateid) FROM rawdata AS d1 WHERE d1.CustID = r1.CustID) AS CustDate1      ,r2.CustID as CustID2      , (SELECT MAX(d2.dateid) FROM rawdata AS d2 WHERE d2.CustID = r2.CustID) AS CustDate2            from  rawdata r1            join  rawdata r2            on          r1.CCNumber = r2.CCNumber            where r1.CustID&amp;lt;&amp;gt;r2.CustID) AS a[/code]</description><pubDate>Tue, 20 Jul 2010 07:40:04 GMT</pubDate><dc:creator>Doylie</dc:creator></item><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Hi MattThere's a lot of work involved in getting it just right and how you do it depends on what it's for - if you're doing a one-off process then it would be different to a process running every half hour or overnight.A really good start is to inner join the table to itself on postcode, address line 1, surname and the first character of the forename, keeping the ID on the LHS. The capture rate is very good and the false positives are low. Shove the lot into a temp table and reprocess there because it's usually quite expensive. You could, for instance, add more dupes to the temp table based on matches using other criteria.At some point you mark sets of potentially-duplicated rows as "dupesets". After this you can fine-tune and eliminate poor/ambiguous matches.Finally, mark the principal of each dupeset.Which part of this overall process are you having difficulty with?</description><pubDate>Tue, 20 Jul 2010 05:19:03 GMT</pubDate><dc:creator>ChrisM@Work</dc:creator></item><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Hi Chris, Thanks for taking the time to respond. I am indeed trying to dedupe a customer list so that i can have a master customer that is linked to all the other historical instances of that customer on our system. i know, seems like such a regular and simple concept...very difficult!! :crazy:The customer detail can vary through all the usual detail that is held. email, home address detail, first name and last name but you cannot guarantee that any of this information on its own is going to guarantee a match. i.e. just because the first name last name matches another record with a different customer id, doesn't mean its the same person. Therefore we are splitting it down into passes with each pass having a degree of confidence assigned to it. The sample data that I have included is credit card detail (not real detail obviously!!!) that has been concatenated together to make a unique identity field. This will be the same process used when comparing other data i.e. telephone number will be concatenated with first_name last_name to create a unique identity field that will then be used to match against all other records. Assume that this data all has the same name attached to it as this level of matching is the easy bit. (code below)on this pass I am looking initially at a section of around 1million customers and around the same number of transactions to search through to identify who has used which CC detail. We would then potentially be looking to open this out to the entire customer base of which i haven't got a number. If we can find a structure that work then we can worry about runtime, processing and optimization at a later date. Again, many thanks for your help. Let me know if there is anything else that you need.ThanksMatt[code="sql"]select      '20090503' as dateid            ,7320 as CustID            ,'370090510' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20090420' as dateid            ,25861 as CustID            ,'370090510' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Name            union select      '20100404' as dateid            ,7320 as CustID            ,'370080987' as CCNumber	        ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20090703' as dateid            ,25861 as CustID            ,'370080987' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100603' as dateid            ,7320 as CustID            ,'594220191' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100705' as dateid            ,25861 as CustID            ,'594220191' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100719' as dateid            ,7320 as CustID            ,'594220654' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Nameunion select      '20100720' as dateid            ,7321 as CustID            ,'594220654' as CCNumber            ,'Jane' as First_Name            ,'Smith' as Last_Name[/code]</description><pubDate>Tue, 20 Jul 2010 04:50:15 GMT</pubDate><dc:creator>Doylie</dc:creator></item><item><title>RE: De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Hi MattYou're trying to dedupe a customer list, right? So you end up with a principal (or master) with an associated set of dupes?Deduping companies is extraordinarily difficult and complicated because of the number of different ways that the company type can be entered; deduping people is much easier - which type of customer are you referring to?What criteria do you think would be good for matching?Do you have separate forename &amp; surname (or company name), address1...addressn, postcode?If you want to use any of these for matching, can you include sample data please?How many rows are you dealing with?</description><pubDate>Tue, 20 Jul 2010 04:12:44 GMT</pubDate><dc:creator>ChrisM@Work</dc:creator></item><item><title>De-duplicating Customer Detail and assigning parent child structure to multiple customer records</title><link>http://www.sqlservercentral.com/Forums/Topic955399-392-1.aspx</link><description>Hi Guys, I have been tasked with de-duplicating data in the company by using a number of different passes that combine and compare customer detail to create a unique identifier. This identifier can be used to match customers on common detail.i.e.telephone number, postcode, Credit card number (and a combination of all). I am then attempting to highlight a master using a date of order field and link this master to all of the child records that this customer has (identified by the search criteria)OK, I hope that makes sense. Here is the problem. I can identify the master customer but linking this Customer back to all of the other records in the parent child structure is proving very difficult as you are in essence trying to link customers have different detail. By the eye you can tell that they are the same customer, but how would you go about coding that!? I have tried for about a week to get this right, but with no joy so I am opening it  out to you super coding smart people...here is some code for sample data to get you started, any questions let me know and i will try and explain as much as possible..  MASSIVELY appreciate any help. ThanksMatt[code="sql"]with  rawdata as(select      '20090503' as dateid            ,7320 as CustID            ,'370090510' as CCNumberunion select      '20090420' as dateid            ,25861 as CustID            ,'370090510' as CCNumberunion select      '20100404' as dateid            ,7320 as CustID            ,'370080987' as CCNumberunion select      '20090703' as dateid            ,25861 as CustID            ,'370080987' as CCNumberunion select      '20100603' as dateid            ,7320 as CustID            ,'594220191' as CCNumberunion select      '20100705' as dateid            ,25861 as CustID            ,'594220191' as CCNumberunion select      '20100719' as dateid            ,7320 as CustID            ,'594220654' as CCNumberunion select      '20100720' as dateid            ,7321 as CustID            ,'594220654' as CCNumber)[/code]  </description><pubDate>Tue, 20 Jul 2010 03:48:54 GMT</pubDate><dc:creator>Doylie</dc:creator></item></channel></rss>