Remove redundant data from address column

  • Hi,

    Consider the folllwing two records:

    if object_id('test') > 0 drop table dbo.test

    create table dbo.test (

    master_idint,

    Titlevarchar(20),

    Forenamevarchar(50),

    Surnamevarchar(50),

    Address1varchar(200),

    Address2varchar(200),

    Address3varchar(200),

    Address4varchar(200),

    Townvarchar(100),

    Countyvarchar(100),

    Postcodevarchar(20),

    Hierarchyint

    )

    insert into dbo.test

    select 75984,'Mr','O','SQLady','6 Vivien Avenue Midsomer Norto','Midsomer Norton',NULL, NULL, 'RADSTOCK',NULL, 'BA6 2VG',2 union all

    select 76144,'Mrs','N','Replacemento','56 Vivien Avenue Midsomer Nort','Midsomer Norton',NULL, NULL, 'RADSTOCK',NULL, 'BA6 2VG',2

    select * from dbo.test

    As you can see, Address1 contains part of address 2 which is incorrect. I tried to use replace but this doesn't find the whole of address2 in adress1 so nothing changes. Any ideas how to remove the Midsomer Nort from address 1?

    Thanks in advance.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • I've seperated the street number but if anyone can help me remove the redundant information from the address1 I'd be very grateful.

    update a

    set a.streetno = F.StreetNo

    from dbo.test as a

    cross apply (select charindex(' ', Address1) AS CommaPos) F0

    cross apply (select case when CommaPos = 0 then 'Bad Address' else SUBSTRING(Address1,1,CommaPos - 1) end as StreetNo) F

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • It's one of those things that sounds trivially easy until you start trying to do it.

    You could set it to remove subsets, but then you end up with Address2 = "No 6", and you end up deleting "No" out of "1234 North Main St" in Address1, so you end up with "1234 rth Main St".

    How important is this to the business you work for/with? In the US (that doesn't look like a US address, but I didn't check), you can buy software that will clean up addresses for you. It's called "CASSing", and the vendors have to be "PAVE Certified" by the USPS. Canada has a similar system available from various vendors. I believe so do most of the European countries. I haven't checked Latin American countries for that kind of thing (the address looks like it might be LATAM or Spain is why I mention those).

    If address validation and clean-up matters, I'd look into getting something like that. You can Bing/Google the companies that provide it. If you go that route (assuming it's available for this address' local), it'll clean those kinds of things up for you, and help you identify the ones that it can't auto-clean. Very efficient.

    Addresses, like human names, are a lot trickier to clean up than they look at first glance.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared, everything you say is correct.

    I'm in the UK and yes we do use software to clean addresses for some customer data but others don't actually want their addresses rebuilding so we're having to manually cleanse the address fields which is a nightmare to do.

    My main project is the data merge but without correctly formatted address fields this process is proving very difficult. Another member Chris@Work/Chris@Home has given some great hints in another thread but I feel like I’m' been going round in circles this week. I can't even get my household merge to work!

    The current scripts used, although not very efficient but they do the job. To get past this problem the previous developer basically did a comparison of the LEFT(Address1, 12).

    I was hoping to be able to standardise my addresses somehow and use the fuzzy mating ratio suggested by Chris to work out my dupes.

    It’s the end of my second week in this job and I’m already starting to panic! :unsure:

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Thinking about this more, I wonder if the following will work.

    Street names in the UK tend to end with words like Street, Avenue, Crescent etc, if I replace these with say an asterisk then I can remove anything right of the asterisk to get my street name.

    Will need to test on a large address data set to see how it performs.

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

  • Abu Dina (8/3/2012)


    Thinking about this more, I wonder if the following will work.

    Street names in the UK tend to end with words like Street, Avenue, Crescent etc, if I replace these with say an asterisk then I can remove anything right of the asterisk to get my street name.

    Will need to test on a large address data set to see how it performs.

    Be careful with that. Take into account that "Street" is often abbreviated "St", but so is "Saint". So, if you have "1234 St George Way", you'll end up losing part of the address that matters.

    You might also want to make sure appartment/unit/suite numbers, and directionals don't get lost through something like "1234 1st Ave S, #6", where you'd lose the "South" directional, and the unit number.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared (8/6/2012)


    Abu Dina (8/3/2012)


    Thinking about this more, I wonder if the following will work.

    Street names in the UK tend to end with words like Street, Avenue, Crescent etc, if I replace these with say an asterisk then I can remove anything right of the asterisk to get my street name.

    Will need to test on a large address data set to see how it performs.

    Be careful with that. Take into account that "Street" is often abbreviated "St", but so is "Saint". So, if you have "1234 St George Way", you'll end up losing part of the address that matters.

    You might also want to make sure appartment/unit/suite numbers, and directionals don't get lost through something like "1234 1st Ave S, #6", where you'd lose the "South" directional, and the unit number.

    You've been through this pain before, haven't you? πŸ™‚

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • How about "cleaning" all address into another table with a pointer back to the original "unclean" address.

    You can than compare the clean addresses to identify duplicates.

  • Phil Parkin (8/6/2012)


    GSquared (8/6/2012)


    Abu Dina (8/3/2012)


    Thinking about this more, I wonder if the following will work.

    Street names in the UK tend to end with words like Street, Avenue, Crescent etc, if I replace these with say an asterisk then I can remove anything right of the asterisk to get my street name.

    Will need to test on a large address data set to see how it performs.

    Be careful with that. Take into account that "Street" is often abbreviated "St", but so is "Saint". So, if you have "1234 St George Way", you'll end up losing part of the address that matters.

    You might also want to make sure appartment/unit/suite numbers, and directionals don't get lost through something like "1234 1st Ave S, #6", where you'd lose the "South" directional, and the unit number.

    You've been through this pain before, haven't you? πŸ™‚

    Was the DBA for a direct-mail marketing company for 7 years. If there's one type of data I'm overqualified to suffer from, it's name-and-address tables.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Below works! If I understood correctly πŸ™‚

    create table #test (

    master_idint,

    Titlevarchar(20),

    Forenamevarchar(50),

    Surnamevarchar(50),

    Address1varchar(200),

    Address2varchar(200),

    Address3varchar(200),

    Address4varchar(200),

    Townvarchar(100),

    Countyvarchar(100),

    Postcodevarchar(20),

    Hierarchyint

    )

    insert into #test

    select 75984,'Mr','O','SQLady','6 Vivien Avenue Midsomer Norto','Midsomer Norton',NULL, NULL, 'RADSTOCK',NULL, 'BA6 2VG',2 union all

    select 76144,'Mrs','N','Replacemento','56 Vivien Avenue Midsomer Nort','Midsomer Norton',NULL, NULL, 'RADSTOCK',NULL, 'BA6 2VG',2

    select * from #test

    DECLARE @index INT

    DECLARE @Add1 VARCHAR(MAX)

    DECLARE @Add2 VARCHAR(MAX)

    DECLARE Addres CURSOR FOR SELECT address1, address2 FROM #test

    OPEN addres

    FETCH NEXT FROM addres INTO @add1, @add2

    WHILE @@FETCH_STATUS = 0

    BEGIN

    SET @index = 1

    WHILE @index <= LEN(@add2)

    BEGIN

    IF @Add2 = RIGHT(@add1, LEN(@Add2))

    BEGIN

    UPDATE #test

    SET address1 = REPLACE(address1, @Add2, '')

    WHERE address1 = @add1

    END

    SET @add2 = LEFT(@add2, LEN(@Add2)-1)

    SET @index = @index +1

    END

    FETCH NEXT FROM addres INTO @add1, @add2

    END

    CLOSE addres

    DEALLOCATE addres

    SELECT * FROM #test

    DROP TABLE #test

  • Yes, that works for the sample data provided (although the cursor isn't necessary) but what happens when you have '11 High Street' and 'Chester-le-Street' as your address1 and address2?

    I think Michael's idea is the best, if this exercise is really necessary. It's going to involve a lot of manual work. I can't understand why you'd give your customers a choice about what format their address is stored in. The format approved by Royal Mail should be the one to use. There may be something I don't know about your business, so perhaps I'm wrong.

    As Gus mentions, beware of assuming street names follow a standard. Here's some examples of some well-known streets in London that don't: Strand, Kensington Gore, Eastcheap, New Change, Green Lanes.

    John

  • John - Didn't understood that how code will fail there? May you please elaborate a bit.

  • My apologies. I performed a visual inspection but didn't actually test it. Try it with '11 Village Green' and 'Greenford' instead.

    John

  • Well John I agree if such addresses exists in UK then obviously my code will fail. May be there is a lot I need to learn about UK. A newbie is newbie and an expert is an expert. πŸ™‚

  • John Mitchell-245523 (8/8/2012)

    I think Michael's idea is the best, if this exercise is really necessary. It's going to involve a lot of manual work. I can't understand why you'd give your customers a choice about what format their address is stored in. The format approved by Royal Mail should be the one to use. There may be something I don't know about your business, so perhaps I'm wrong.

    John

    Yes, MVJ's idea sounds good.

    I'm working on a project for the data services department. The source data comes from various customers so we don't have control on data quality I'm afraid!

    I will create a copy of the data which links back to the original source and rebuild the address fields using PAF software. How I get this bit automated is another problem I will need to work out!

    ---------------------------------------------------------

    It takes a minimal capacity for rational thought to see that the corporate 'free press' is a structurally irrational and biased, and extremely violent, system of elite propaganda.
    David Edwards - Media lens[/url]

    Society has varying and conflicting interests; what is called objectivity is the disguise of one of these interests - that of neutrality. But neutrality is a fiction in an unneutral world. There are victims, there are executioners, and there are bystanders... and the 'objectivity' of the bystander calls for inaction while other heads fall.
    Howard Zinn

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply