parsing example

  • Hi everyone,

    Could someone provide a(n) example(s) for parsing address fields? Unfortunately, in this database, the address field contains almost everything, that is street number, street name, PO Boxes... ONly the city and zip are in separate fields. What I am trying to do is match addresses of accounts just added (>= yesterday) with addresses of old accounts. If I could just parse out the street name (or number) and then do a "LIKE ...%" that would probably be good enough. The problem is I have data in all these different formats:

    PO BOX 1234

    6N040 SNAKE AVE

    1812 RED BUD RD

    645 N MARYWOOD AVE APT 15B

    228 E 4TH AVE

    Any ideas are greatly appreciated...

    Thanks!!

  • I'm afraid this will be a problem. You can't parse strings automatically with that much variability - and I suppose this was just an example. There can be various abbreviations (e.g. St., Str. for Street), reverse order (street, number vs. number, street) and last but not least typing errors. Part of this work will always have to be done by humans, even the best code can only handle certain percentage of such rows correctly... and as I look at your example, it will be quite a big part.

    In my opinion what you need is to change structure and separate street, house number, city etc. into own column each. This will mean lots of tedious handwork, but once you are finished, you can easily write code that will perform necessary comparisions and find possible duplicates (or whatever you need)... and of course make sure that all new data is immediately stored in separate columns.

    I have no idea how you could leave structure intact and write query, which could do anything properly based on such variable data.

  • That is what I was suspecting, too... I guess I will have to do this in VB, process one line at a time.

    Thanks you very much for your opinion!

    Marius

  • If you have 2005 Enterprise edition, look into the fuzzy match component of SSIS. It can handle this type of compairson nicely. It's also available for other editions, but it can't be run as an automated package on the server and must be run manually though the visual studio enviroment. We have both parsing & matching (our code) and the fuzzy match at the company I work for. We use the fuzzy match to find duplicates that our code couldn't find and it's suprizing the accuracy it has. It handles spelling errors and different orders of saying the address (124 helm st. east vs 124 heim street E) very gracefully.

  • marius.draganescu (9/25/2008)


    That is what I was suspecting, too... I guess I will have to do this in VB, process one line at a time.

    Thanks you very much for your opinion!

    Marius

    Actually - figuring out the algorithm is going to be the hard part.

    if you CAN figure out a reliable way to find the pattern then build it as a SQLCLR functions, so you can access it directly within SQL. Regex and string patterns are one spot those kinds of functions work very well.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I know what you mean by string patterns, but what is regex?

  • marius.draganescu (9/25/2008)


    I know what you mean by string patterns, but what is regex?

    Regular Expresssions. It's the CLR object/language/syntax usually used to parse out ugly/complex patterns in .NET.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • marius.draganescu (9/25/2008)


    That is what I was suspecting, too... I guess I will have to do this in VB, process one line at a time.

    Thanks you very much for your opinion!

    Marius

    Heh... trust me... no matter how hard you try, you will not get it right without have a list of all streets in the U.S. Many have tried, all have failed.

    You need to buy a CASS certification program from the USPS or some 3rd party. Not only will it do the split for you, but it will also validate the addresses for you. It's amazing how many people don't actually know their correct address. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Interestin! I never heard of that before. My boss will hear about it Monday...

    Thanks!

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply