Cleaning up duplicate data in one field??

  • Hello...

    Once upon a time, we had done a bulk insert into a table and somehow repeated the same data in the mailing address field without catching it. I am trying to clean these up and I am having a difficult time finding a clean way to look for all the records where this has happened.

    For example:

    MailingAddress varchar(40)

    1016 ROCKY CREEK DRIVE 1016 ROCKY CREEK

    1100 CASTLE ROCK VILLAS DRIVE 1100 CASTL

    In these examples, you can see where it was somehow inserted twice up to the point where the data was maxed in the field. I need to keep the correct instance and clean up the extra characters. Any ideas?

    Thank you!

  • This solution assumes that the first "word" (the left characters up to the first space in the address) do not repeat in the actual address. See if this helps.

    CREATE TABLE #temp (mailingAddress VARCHAR(40))

    INSERT INTO #temp

    SELECT '1016 ROCKY CREEK DRIVE 1016 ROCKY CREEK '

    UNION ALL

    SELECT '1100 CASTLE ROCK VILLAS DRIVE 1100 CASTL'

    SELECT RTRIM(LEFT(mailingAddress,CHARINDEX(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)),mailingAddress,LEN(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress))))-1)), * FROM #temp T

    DROP TABLE #temp

  • That does seem to clean up my two examples, but what would be the best way to track down all the records in my db that are like this so I can insert them into the temp table for update?

  • I would try this on a copy of your table first as I haven't tested this fully with all variations of addresses but couldn't you just update your table in place?

    CREATE TABLE #temp (mailingAddress VARCHAR(40))

    INSERT INTO #temp

    SELECT '1016 ROCKY CREEK DRIVE 1016 ROCKY CREE'

    UNION ALL

    SELECT '1100 CASTLE ROCK VILLAS DRIVE 1100 CASTL'

    UNION ALL

    SELECT 'THIS ADDRESS SHOULD NOT BE CHANGED'

    UPDATE #temp

    SET mailingAddress = RTRIM(LEFT(mailingAddress,CHARINDEX(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)),mailingAddress,LEN(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress))))-1))

    WHERE CHARINDEX(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)),mailingAddress,LEN(LEFT(mailingAddress,CHARINDEX(' ',mailingAddress)))) > 1

    SELECT * FROM #temp

    DROP TABLE #temp

    Where #temp is your table name?

  • Thanks Matt... This is going to work.

    I did find other records that were duplicated a bit different. I am assuming I will need to tweak your code to change those.

    example

    1105 1105 Brookshire Ln

    2450 2450 Middleset Trail

    In these.. I would need to delete one of the numbers.

  • ... for each row

    ....... what is the index of the last character

    ........... of the largest substring

    ............... repeating no more than once

    Is the above statement a basis for a solution to the problem?

    i.e. Having found the above for each row, take the left LEFT(<address>, <lastindex>) and use this in the UPDATE? (if it it a large table of course, it may be better to INSERT into a new one, check the results, and then replace the old).

    With the SUBSTRING function and an Auxiliary Table of Numbers, a scan can be created to calculate a substring combinations (starting from the begining of the address) for one, and then working backwards for the comparator (Would the article 'Optimising “Ends With” searches with REVERSE' published on this site help here?).

    The first substring length should be >= to the second (which may be zero-length).

    If previous replies did not solve the issue, would the above help as the basis for a solution (if I understood the problem correctly?)?.

  • Yeah, the update statement above won't work and would actually change the addresses you just gave to be 1105 and 2450 respectively. If you are sure those are the only 2 you could just ignore them in a where clause but if you need to code for this case then you would have to change the current where clause.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply