Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

De-duplicating Customer Detail and assigning parent child structure to multiple customer records Expand / Collapse
Author
Message
Posted Tuesday, July 20, 2010 3:48 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, February 28, 2011 9:44 AM
Points: 181, Visits: 171
Hi Guys,

I have been tasked with de-duplicating data in the company by using a number of different passes that combine and compare customer detail to create a unique identifier. This identifier can be used to match customers on common detail.i.e.telephone number, postcode, Credit card number (and a combination of all).

I am then attempting to highlight a master using a date of order field and link this master to all of the child records that this customer has (identified by the search criteria)

OK, I hope that makes sense. Here is the problem.

I can identify the master customer but linking this Customer back to all of the other records in the parent child structure is proving very difficult as you are in essence trying to link customers have different detail. By the eye you can tell that they are the same customer, but how would you go about coding that!?

I have tried for about a week to get this right, but with no joy so I am opening it out to you super coding smart people...

here is some code for sample data to get you started, any questions let me know and i will try and explain as much as possible..

MASSIVELY appreciate any help.

Thanks

Matt


with rawdata as
(
select '20090503' as dateid
,7320 as CustID
,'370090510' as CCNumber
union
select '20090420' as dateid
,25861 as CustID
,'370090510' as CCNumber
union
select '20100404' as dateid
,7320 as CustID
,'370080987' as CCNumber
union
select '20090703' as dateid
,25861 as CustID
,'370080987' as CCNumber
union
select '20100603' as dateid
,7320 as CustID
,'594220191' as CCNumber
union
select '20100705' as dateid
,25861 as CustID
,'594220191' as CCNumber
union
select '20100719' as dateid
,7320 as CustID
,'594220654' as CCNumber
union
select '20100720' as dateid
,7321 as CustID
,'594220654' as CCNumber
)







Post #955399
Posted Tuesday, July 20, 2010 4:12 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, December 18, 2014 1:43 AM
Points: 6,890, Visits: 14,254
Hi Matt

You're trying to dedupe a customer list, right? So you end up with a principal (or master) with an associated set of dupes?

Deduping companies is extraordinarily difficult and complicated because of the number of different ways that the company type can be entered; deduping people is much easier - which type of customer are you referring to?

What criteria do you think would be good for matching?
Do you have separate forename & surname (or company name), address1...addressn, postcode?
If you want to use any of these for matching, can you include sample data please?
How many rows are you dealing with?


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #955409
Posted Tuesday, July 20, 2010 4:50 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, February 28, 2011 9:44 AM
Points: 181, Visits: 171
Hi Chris,

Thanks for taking the time to respond.

I am indeed trying to dedupe a customer list so that i can have a master customer that is linked to all the other historical instances of that customer on our system. i know, seems like such a regular and simple concept...very difficult!!

The customer detail can vary through all the usual detail that is held. email, home address detail, first name and last name but you cannot guarantee that any of this information on its own is going to guarantee a match. i.e. just because the first name last name matches another record with a different customer id, doesn't mean its the same person.

Therefore we are splitting it down into passes with each pass having a degree of confidence assigned to it. The sample data that I have included is credit card detail (not real detail obviously!!!) that has been concatenated together to make a unique identity field. This will be the same process used when comparing other data i.e. telephone number will be concatenated with first_name last_name to create a unique identity field that will then be used to match against all other records.

Assume that this data all has the same name attached to it as this level of matching is the easy bit. (code below)

on this pass I am looking initially at a section of around 1million customers and around the same number of transactions to search through to identify who has used which CC detail. We would then potentially be looking to open this out to the entire customer base of which i haven't got a number. If we can find a structure that work then we can worry about runtime, processing and optimization at a later date.

Again, many thanks for your help. Let me know if there is anything else that you need.

Thanks

Matt

select      '20090503' as dateid
,7320 as CustID
,'370090510' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20090420' as dateid
,25861 as CustID
,'370090510' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name

union
select '20100404' as dateid
,7320 as CustID
,'370080987' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20090703' as dateid
,25861 as CustID
,'370080987' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100603' as dateid
,7320 as CustID
,'594220191' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100705' as dateid
,25861 as CustID
,'594220191' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100719' as dateid
,7320 as CustID
,'594220654' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100720' as dateid
,7321 as CustID
,'594220654' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name





Post #955423
Posted Tuesday, July 20, 2010 5:19 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, December 18, 2014 1:43 AM
Points: 6,890, Visits: 14,254
Hi Matt

There's a lot of work involved in getting it just right and how you do it depends on what it's for - if you're doing a one-off process then it would be different to a process running every half hour or overnight.
A really good start is to inner join the table to itself on postcode, address line 1, surname and the first character of the forename, keeping the ID on the LHS. The capture rate is very good and the false positives are low. Shove the lot into a temp table and reprocess there because it's usually quite expensive. You could, for instance, add more dupes to the temp table based on matches using other criteria.
At some point you mark sets of potentially-duplicated rows as "dupesets".
After this you can fine-tune and eliminate poor/ambiguous matches.
Finally, mark the principal of each dupeset.

Which part of this overall process are you having difficulty with?


“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #955445
Posted Tuesday, July 20, 2010 7:40 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, February 28, 2011 9:44 AM
Points: 181, Visits: 171
This will be potentially a nightly run that will update the customer details. This table will then have the various passes ran against it to identify changes in the relationships between the master and its children. This could mean customerID's being added, them being removed or changes in which is the primary.

It is not the identification of the duplicates, or even the identification of the most recent it is when you try and identify all the children below the primary.

If you run the below piece of code then you can see a simplified example (using a horrible correlated sub query) that you are going to be left with two pairs of interchanged ID's.

this will leave us once a distinct is applied with this situation :

Master Child
7320 25861
7321 7320

how would you go about linking these customers and allowing for all different variations of this problem that will arise. Some customers have used over 19 different credit cards... at the moment we are looking at updates in a big loop but this is going to have to loop through EVERY customer for EVERY credit card which is going to be a massive hit everyday. I was interested to see if anyone had any better suggestions or solutions to a similar problem.

Thanks

Matt


with rawdata as
(
select '20090503' as dateid
,7320 as CustID
,'370090510' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20090420' as dateid
,25861 as CustID
,'370090510' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name

union
select '20100404' as dateid
,7320 as CustID
,'370080987' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20090703' as dateid
,25861 as CustID
,'370080987' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100603' as dateid
,7320 as CustID
,'594220191' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100705' as dateid
,25861 as CustID
,'594220191' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100719' as dateid
,7320 as CustID
,'594220654' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
union
select '20100720' as dateid
,7321 as CustID
,'594220654' as CCNumber
,'Jane' as First_Name
,'Smith' as Last_Name
order by custid
)
,
rawdata2 as
(
select custid,MAX(dateid) as maxdateid
from rawdata
group by custid
)

SELECT distinct
CASE WHEN CustDate1 > CustDate2 THEN CustID1 ELSE CustID2 END AS [GrandMasterCustomer]
,CASE WHEN CustDate1 > CustDate2 THEN CustDate1 ELSE CustDate2 END AS [GrandMasterDate]
,CASE WHEN CustDate1 > CustDate2 THEN CustID2 ELSE CustID1 END AS [SubordinateCustomer]
FROM
(
select
r1.CustID as CustID1
, (SELECT MAX(d1.dateid) FROM rawdata AS d1 WHERE d1.CustID = r1.CustID) AS CustDate1
,r2.CustID as CustID2
, (SELECT MAX(d2.dateid) FROM rawdata AS d2 WHERE d2.CustID = r2.CustID) AS CustDate2
from rawdata r1
join rawdata r2

on r1.CCNumber = r2.CCNumber
where r1.CustID<>r2.CustID
) AS a






Post #955558
Posted Tuesday, July 20, 2010 8:36 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Thursday, December 18, 2014 1:43 AM
Points: 6,890, Visits: 14,254
Have a play with this, Matt. Hopefully it will give you some ides.

IF OBJECT_ID( 'tempdb..#rawdata') IS NOT NULL DROP TABLE #rawdata
IF OBJECT_ID( 'tempdb..#WorkingSet') IS NOT NULL DROP TABLE #WorkingSet

CREATE TABLE #rawdata (dateid DATETIME, CustID INT, CCNumber CHAR(9), First_Name VARCHAR(10), Last_Name VARCHAR(10))
INSERT INTO #rawdata (dateid, CustID, CCNumber, First_Name, Last_Name)
SELECT '20090503', 7320, '370090510', 'Jane', 'Smith' union
SELECT '20090420', 25861, '370090510', 'Jane', 'Smith' union
SELECT '20100404', 7320, '370080987', 'Jane', 'Smith' union
SELECT '20090703', 25861, '370080987', 'Jane', 'Smith' union
SELECT '20100603', 7320, '594220191', 'Jane', 'Smith' union
SELECT '20100705', 25861, '594220191', 'Jane', 'Smith' union
SELECT '20100719', 7320, '594220654', 'Jane', 'Smith' union
SELECT '20100720', 7321, '594220654', 'Jane', 'Smith'

SELECT
DupesetID = CAST(NULL AS BIGINT),
r1.*,
r2_CustID = r2.CustID
INTO #WorkingSet
FROM #rawdata r1
INNER JOIN #rawdata r2 ON r2.CCNumber = r1.CCNumber AND r2.CustID <> r1.CustID
ORDER BY r1.CustID

-- assign dupesetID
UPDATE d SET DupesetID = s.DupesetID
FROM #WorkingSet d
INNER JOIN (
SELECT DupesetID = ROW_NUMBER() OVER (ORDER BY Last_Name, CCNumber),
Last_Name, CCNumber
FROM #WorkingSet
GROUP BY Last_Name, CCNumber
) s ON s.Last_Name = d.Last_Name
AND s.CCNumber = d.CCNumber

-- merge sets with same customerID to earliest DupesetID
UPDATE d SET DupesetID = s.DupesetID
FROM #WorkingSet d
INNER JOIN (
SELECT CustID, DupesetID = MIN(DupesetID), OLDDupesetID = MAX(DupesetID)
FROM #WorkingSet
GROUP BY CustID
HAVING COUNT(*) > 1 AND MIN(DupesetID) <> MAX(DupesetID)
) s ON s.OLDDupesetID = d.DupesetID

-- merge again if necessary


-- assign principal, and number of rows in dupeset
SELECT MasterCustomer = ROW_NUMBER() OVER (PARTITION BY d.DupesetID ORDER BY dateid DESC),
DupesetSize = COUNT(*) over (PARTITION BY d.DupesetID),
d.DupesetID, d.dateid, d.CustID, d.CCNumber, d.First_Name, d.Last_Name, d.r2_CustID
FROM #WorkingSet d
ORDER BY DupesetID, MasterCustomer



“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Post #955630
Posted Tuesday, July 20, 2010 10:31 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, February 28, 2011 9:44 AM
Points: 181, Visits: 171
Thanks very much for this. I will work through this to see if it does what i need. I have a feeling that this will only deal with one to one relationship rather than the one to many that is needed. It is difficult to post too much sample data though...

I massively appreciate your time. We are working on a BIG update statement wrapped in a big while loop which may work...if i come up with a solution i will let you know!!

Thanks

Matt
Post #955742
Posted Wednesday, July 21, 2010 10:43 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Today @ 7:34 AM
Points: 1,945, Visits: 3,182
Have you looked at Melissa Data , SSA and other companies that specifies in cleaning up mailing lists and such like?

Books in Celko Series for Morgan-Kaufmann Publishing
Analytics and OLAP in SQL
Data and Databases: Concepts in Practice
Data, Measurements and Standards in SQL
SQL for Smarties
SQL Programming Style
SQL Puzzles and Answers
Thinking in Sets
Trees and Hierarchies in SQL
Post #956522
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse