Entity Resolution

Question

Entity Resolution

drew.georgopulos

SSCarpal Tunnel

Points: 4931
More actions
January 26, 2015 at 9:54 am

#294262

The problem we are trying to solve is how to distinguish one thing from another. It never was an issue because things had been processed in silos, but in consolidating the data sources all hell is breaking loose.
Brief background is that SSN used to work to stand for one and only one claimant. However, in looking at some reports after the consolidation, we found that the same SSN is sometimes reused, once for a member and again for their covered dependent, meaning an employee's SSN can be reused for their spouse and or children. So we decided to inject first names and birthdays into the mix to create a unique business key comprised of SSN, First, Last and DOB. A surrogate key would stamp them unique, but wont resolve it because we still don't know whether or not two instances are the same entity.
After experimenting with the four part business key hypothesis we still found cases that break the four part business key's uniqueness assumption. Primarily they're misspelling or truncation of a first name, really indicating the same person (e.g. Cass and Cassandra have the same SSN, last name and birthday but different first names) or that they have different first names and the same birthday and are really twins (Duane and David) using the employee's SSN. There are five different sources, a census, two medical files, a dental file and a prescription file, all of which can add their own spin to the identification.
Trying to do it via script or SSIS fuzzy grouping still seems to demand personal intervention to resolve duplicates, which we really would like to avoid. Is this fixable without DQS and or MDM and if it is, what's the best resource to read?
Thanks for bearing with me...im sure someone has fixed this before.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply

stevenb 14609 SSCommitted Points: 1832 More actions · Answer 1

drew.georgopulos (1/26/2015)
The problem we are trying to solve is how to distinguish one thing from another. It never was an issue because things had been processed in silos, but in consolidating the data sources all hell is breaking loose.
Brief background is that SSN used to work to stand for one and only one claimant. However, in looking at some reports after the consolidation, we found that the same SSN is sometimes reused, once for a member and again for their covered dependent, meaning an employee's SSN can be reused for their spouse and or children. So we decided to inject first names and birthdays into the mix to create a unique business key comprised of SSN, First, Last and DOB. A surrogate key would stamp them unique, but wont resolve it because we still don't know whether or not two instances are the same entity.
After experimenting with the four part business key hypothesis we still found cases that break the four part business key's uniqueness assumption. Primarily they're misspelling or truncation of a first name, really indicating the same person (e.g. Cass and Cassandra have the same SSN, last name and birthday but different first names) or that they have different first names and the same birthday and are really twins (Duane and David) using the employee's SSN. There are five different sources, a census, two medical files, a dental file and a prescription file, all of which can add their own spin to the identification.
Trying to do it via script or SSIS fuzzy grouping still seems to demand personal intervention to resolve duplicates, which we really would like to avoid. Is this fixable without DQS and or MDM and if it is, what's the best resource to read?
Thanks for bearing with me...im sure someone has fixed this before.

Ugh... problems like this are the worst. I can't think of an automated way to untangle this mess. When I approach problems like this, I think to myself how I would be able to look at the data and know what it is supposed to be and then how would I write code that would be able to make the same decision I just made regarding the data. In a case like this, I'd be afraid that there would be too many exceptions to my rules and by the time you got done programming all the exceptions, you might as well have just done all the cataloging manually. Or, you can catalog the things that are straight forward and manually look at things that get flagged as potential issue.

The other thing I can think of is if you still have your source data, you can see what the data looked like before the merge. Maybe you can find something that would let you map out the data as it is in the new system.

I would think you want to create a surrogate key for your new entities if the keys you have chosen aren't cutting it. If you can create your surrogate key based off of the original system and then figure out how to map the key to the new system, you might make some headway. So, if you have a row in the original system that's identified by say a SSN and a timestamp, you could mark it with your surrogate key and then find that row in the new consolidated system by timestamp and SSN and mark it with the new key.

Sorry, that's the best I can come up with at the moment. Maybe other people can come up with better ideas.

drew.georgopulos SSCarpal Tunnel Points: 4931 More actions · Answer 2

Thanks a lot, that's kind of what I did to get here, iow, I did a row_number over partition by my assertion of uniqueness. There were about eight hundred failures out of forty six hundred claimant records, meaning even though I got a rn of 1, they're still ambigous. I'm leaning towards undertaking the fix in DQS instead of scripts because even though there are a lot of exceptions and synonyms, at least I can book them and use it a "show only new entries" to get some traction over it and then maybe ssis to maintain it.

I appreciate your response because I dreaded showing myself as a poor kettle of fish, and misery loves company ??, so it was some comfort to achieve resonance with another practioner trampling out the vineyard so to speak.

Yes I will add a surrogate going in, that may help as well. I think its going to be heavily iterative going in, but I also think that's what the inspiration for creating the tool was.

Thanks again.