De-duplicating Customer Detail and assigning parent child structure to multiple customer records

  • Hi Guys,

    I have been tasked with de-duplicating data in the company by using a number of different passes that combine and compare customer detail to create a unique identifier. This identifier can be used to match customers on common detail.i.e.telephone number, postcode, Credit card number (and a combination of all).

    I am then attempting to highlight a master using a date of order field and link this master to all of the child records that this customer has (identified by the search criteria)

    OK, I hope that makes sense. Here is the problem.

    I can identify the master customer but linking this Customer back to all of the other records in the parent child structure is proving very difficult as you are in essence trying to link customers have different detail. By the eye you can tell that they are the same customer, but how would you go about coding that!?

    I have tried for about a week to get this right, but with no joy so I am opening it out to you super coding smart people...

    here is some code for sample data to get you started, any questions let me know and i will try and explain as much as possible..

    MASSIVELY appreciate any help.

    Thanks

    Matt

    with rawdata as

    (

    select '20090503' as dateid

    ,7320 as CustID

    ,'370090510' as CCNumber

    union

    select '20090420' as dateid

    ,25861 as CustID

    ,'370090510' as CCNumber

    union

    select '20100404' as dateid

    ,7320 as CustID

    ,'370080987' as CCNumber

    union

    select '20090703' as dateid

    ,25861 as CustID

    ,'370080987' as CCNumber

    union

    select '20100603' as dateid

    ,7320 as CustID

    ,'594220191' as CCNumber

    union

    select '20100705' as dateid

    ,25861 as CustID

    ,'594220191' as CCNumber

    union

    select '20100719' as dateid

    ,7320 as CustID

    ,'594220654' as CCNumber

    union

    select '20100720' as dateid

    ,7321 as CustID

    ,'594220654' as CCNumber

    )

  • Hi Matt

    You're trying to dedupe a customer list, right? So you end up with a principal (or master) with an associated set of dupes?

    Deduping companies is extraordinarily difficult and complicated because of the number of different ways that the company type can be entered; deduping people is much easier - which type of customer are you referring to?

    What criteria do you think would be good for matching?

    Do you have separate forename & surname (or company name), address1...addressn, postcode?

    If you want to use any of these for matching, can you include sample data please?

    How many rows are you dealing with?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Hi Chris,

    Thanks for taking the time to respond.

    I am indeed trying to dedupe a customer list so that i can have a master customer that is linked to all the other historical instances of that customer on our system. i know, seems like such a regular and simple concept...very difficult!! :crazy:

    The customer detail can vary through all the usual detail that is held. email, home address detail, first name and last name but you cannot guarantee that any of this information on its own is going to guarantee a match. i.e. just because the first name last name matches another record with a different customer id, doesn't mean its the same person.

    Therefore we are splitting it down into passes with each pass having a degree of confidence assigned to it. The sample data that I have included is credit card detail (not real detail obviously!!!) that has been concatenated together to make a unique identity field. This will be the same process used when comparing other data i.e. telephone number will be concatenated with first_name last_name to create a unique identity field that will then be used to match against all other records.

    Assume that this data all has the same name attached to it as this level of matching is the easy bit. (code below)

    on this pass I am looking initially at a section of around 1million customers and around the same number of transactions to search through to identify who has used which CC detail. We would then potentially be looking to open this out to the entire customer base of which i haven't got a number. If we can find a structure that work then we can worry about runtime, processing and optimization at a later date.

    Again, many thanks for your help. Let me know if there is anything else that you need.

    Thanks

    Matt

    select '20090503' as dateid

    ,7320 as CustID

    ,'370090510' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20090420' as dateid

    ,25861 as CustID

    ,'370090510' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100404' as dateid

    ,7320 as CustID

    ,'370080987' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20090703' as dateid

    ,25861 as CustID

    ,'370080987' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100603' as dateid

    ,7320 as CustID

    ,'594220191' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100705' as dateid

    ,25861 as CustID

    ,'594220191' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100719' as dateid

    ,7320 as CustID

    ,'594220654' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100720' as dateid

    ,7321 as CustID

    ,'594220654' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

  • Hi Matt

    There's a lot of work involved in getting it just right and how you do it depends on what it's for - if you're doing a one-off process then it would be different to a process running every half hour or overnight.

    A really good start is to inner join the table to itself on postcode, address line 1, surname and the first character of the forename, keeping the ID on the LHS. The capture rate is very good and the false positives are low. Shove the lot into a temp table and reprocess there because it's usually quite expensive. You could, for instance, add more dupes to the temp table based on matches using other criteria.

    At some point you mark sets of potentially-duplicated rows as "dupesets".

    After this you can fine-tune and eliminate poor/ambiguous matches.

    Finally, mark the principal of each dupeset.

    Which part of this overall process are you having difficulty with?

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • This will be potentially a nightly run that will update the customer details. This table will then have the various passes ran against it to identify changes in the relationships between the master and its children. This could mean customerID's being added, them being removed or changes in which is the primary.

    It is not the identification of the duplicates, or even the identification of the most recent it is when you try and identify all the children below the primary.

    If you run the below piece of code then you can see a simplified example (using a horrible correlated sub query) that you are going to be left with two pairs of interchanged ID's.

    this will leave us once a distinct is applied with this situation :

    Master Child

    7320 25861

    7321 7320

    how would you go about linking these customers and allowing for all different variations of this problem that will arise. Some customers have used over 19 different credit cards... at the moment we are looking at updates in a big loop but this is going to have to loop through EVERY customer for EVERY credit card which is going to be a massive hit everyday. I was interested to see if anyone had any better suggestions or solutions to a similar problem.

    Thanks

    Matt

    with rawdata as

    (

    select '20090503' as dateid

    ,7320 as CustID

    ,'370090510' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20090420' as dateid

    ,25861 as CustID

    ,'370090510' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100404' as dateid

    ,7320 as CustID

    ,'370080987' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20090703' as dateid

    ,25861 as CustID

    ,'370080987' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100603' as dateid

    ,7320 as CustID

    ,'594220191' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100705' as dateid

    ,25861 as CustID

    ,'594220191' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100719' as dateid

    ,7320 as CustID

    ,'594220654' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    union

    select '20100720' as dateid

    ,7321 as CustID

    ,'594220654' as CCNumber

    ,'Jane' as First_Name

    ,'Smith' as Last_Name

    order by custid

    )

    ,

    rawdata2 as

    (

    select custid,MAX(dateid) as maxdateid

    from rawdata

    group by custid

    )

    SELECT distinct

    CASE WHEN CustDate1 > CustDate2 THEN CustID1 ELSE CustID2 END AS [GrandMasterCustomer]

    ,CASE WHEN CustDate1 > CustDate2 THEN CustDate1 ELSE CustDate2 END AS [GrandMasterDate]

    ,CASE WHEN CustDate1 > CustDate2 THEN CustID2 ELSE CustID1 END AS [SubordinateCustomer]

    FROM

    (

    select

    r1.CustID as CustID1

    , (SELECT MAX(d1.dateid) FROM rawdata AS d1 WHERE d1.CustID = r1.CustID) AS CustDate1

    ,r2.CustID as CustID2

    , (SELECT MAX(d2.dateid) FROM rawdata AS d2 WHERE d2.CustID = r2.CustID) AS CustDate2

    from rawdata r1

    join rawdata r2

    on r1.CCNumber = r2.CCNumber

    where r1.CustID<>r2.CustID

    ) AS a

  • Have a play with this, Matt. Hopefully it will give you some ides.

    IF OBJECT_ID( 'tempdb..#rawdata') IS NOT NULL DROP TABLE #rawdata

    IF OBJECT_ID( 'tempdb..#WorkingSet') IS NOT NULL DROP TABLE #WorkingSet

    CREATE TABLE #rawdata (dateid DATETIME, CustID INT, CCNumber CHAR(9), First_Name VARCHAR(10), Last_Name VARCHAR(10))

    INSERT INTO #rawdata (dateid, CustID, CCNumber, First_Name, Last_Name)

    SELECT '20090503', 7320, '370090510', 'Jane', 'Smith' union

    SELECT '20090420', 25861, '370090510', 'Jane', 'Smith' union

    SELECT '20100404', 7320, '370080987', 'Jane', 'Smith' union

    SELECT '20090703', 25861, '370080987', 'Jane', 'Smith' union

    SELECT '20100603', 7320, '594220191', 'Jane', 'Smith' union

    SELECT '20100705', 25861, '594220191', 'Jane', 'Smith' union

    SELECT '20100719', 7320, '594220654', 'Jane', 'Smith' union

    SELECT '20100720', 7321, '594220654', 'Jane', 'Smith'

    SELECT

    DupesetID = CAST(NULL AS BIGINT),

    r1.*,

    r2_CustID = r2.CustID

    INTO #WorkingSet

    FROM #rawdata r1

    INNER JOIN #rawdata r2 ON r2.CCNumber = r1.CCNumber AND r2.CustID <> r1.CustID

    ORDER BY r1.CustID

    -- assign dupesetID

    UPDATE d SET DupesetID = s.DupesetID

    FROM #WorkingSet d

    INNER JOIN (

    SELECT DupesetID = ROW_NUMBER() OVER (ORDER BY Last_Name, CCNumber),

    Last_Name, CCNumber

    FROM #WorkingSet

    GROUP BY Last_Name, CCNumber

    ) s ON s.Last_Name = d.Last_Name

    AND s.CCNumber = d.CCNumber

    -- merge sets with same customerID to earliest DupesetID

    UPDATE d SET DupesetID = s.DupesetID

    FROM #WorkingSet d

    INNER JOIN (

    SELECT CustID, DupesetID = MIN(DupesetID), OLDDupesetID = MAX(DupesetID)

    FROM #WorkingSet

    GROUP BY CustID

    HAVING COUNT(*) > 1 AND MIN(DupesetID) <> MAX(DupesetID)

    ) s ON s.OLDDupesetID = d.DupesetID

    -- merge again if necessary

    -- assign principal, and number of rows in dupeset

    SELECT MasterCustomer = ROW_NUMBER() OVER (PARTITION BY d.DupesetID ORDER BY dateid DESC),

    DupesetSize = COUNT(*) over (PARTITION BY d.DupesetID),

    d.DupesetID, d.dateid, d.CustID, d.CCNumber, d.First_Name, d.Last_Name, d.r2_CustID

    FROM #WorkingSet d

    ORDER BY DupesetID, MasterCustomer

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thanks very much for this. I will work through this to see if it does what i need. I have a feeling that this will only deal with one to one relationship rather than the one to many that is needed. It is difficult to post too much sample data though...

    I massively appreciate your time. We are working on a BIG update statement wrapped in a big while loop which may work...if i come up with a solution i will let you know!!

    Thanks

    Matt

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply