Merge similar records from excel

  • I am working on a project that will require me to get a flat data file (excel spreadsheet) with hundreds of thousands of records. Each record is an Owner, and specifically what they own. There will be a field for OwnerName that I want to figure out a way to pull the data into a database like;

    Table(Owners) - make sure owner is listed only once

    Table(Properties) - joined to owners showing all properties that person owns

    Now the tricky part, the owner names might not be exactly the same. Some records might have;

    Smith, John

    John Smith

    Smith, John T

    etc.

    To make matters worse, this will be a continuous process. I will receive updated excel spreadsheets from time to time and will need to import the new records, many times overwriting the old data. For the good news, there should be an OwnerID that will be unique within the excel data. So as I am merging similar records into the Owners table, I should have a list of OwnerID's that can forever be used to link to the owner.

    Any thoughts on the best process for something like this?

  • i think you need to provider more information.

    it is certainly possible that two people have the same name...how do you know John Smith is the same or a different, new John Smith?

    do you have better elements for uniqueness, like phone number or email address, for example?

    you said you will your source have a uniqueness value, like the OwnerID in your Source excel, so what's wrong with changing the name to the last value received?

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Sure, the flat data file has;

    OwnerID

    OwnerName

    OwnerAddress

    OwnerCity

    OwnerState

    OwnerZip

    PropertyDescription

    County

    etc.

    So John Smith might have OwnerID=1 in Tarrant County. And there might be Smith, John in Reno County that has OwnerID=205. I'm still wrapping my head around the best way to accomplish this, but I'm thinking I might want to find something like owner name is similar AND zip code is the same. I know it's not perfect but it should be fairly accurate in identifying the same person. I could also use owner name is similar AND address is similar. I would have to use a similar logic as I know the data will not be character for character the same.

    So, any ideas on how to perform this sort of data organization?

  • reamades (5/19/2015)


    Sure, the flat data file has;

    OwnerID

    OwnerName

    OwnerAddress

    OwnerCity

    OwnerState

    OwnerZip

    PropertyDescription

    County

    etc.

    So John Smith might have OwnerID=1 in Tarrant County. And there might be Smith, John in Reno County that has OwnerID=205. I'm still wrapping my head around the best way to accomplish this, but I'm thinking I might want to find something like owner name is similar AND zip code is the same. I know it's not perfect but it should be fairly accurate in identifying the same person. I could also use owner name is similar AND address is similar. I would have to use a similar logic as I know the data will not be character for character the same.

    So, any ideas on how to perform this sort of data organization?

    Is the "OwnerID" any good? The way your'e talking about doing it, could (and most likely will) cause you to group multiple distinct individuals.

    When I've had to do this in the past, I used a point system... If name was a match, N points, if address was a match, N points... At the end, add up the points. If the total meets or exceeds a threashold, they are deemed a match. If not, they are regarded as separate individuals.

    As far as how to distribute the data... Start by landing the into a single staging table and then parse it out to a normalized table structure from there.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply