Address clean-up

  • Hi all

    I would like to get a few suggestions on how to do the following.

    Ive got a table with 10 miilion address in which needs to go for GeoCoding. This geocoding will be done in batches and I need to find the accounts first of all which is closes to a major city or town. This will be first geocoded and the other smaller towns and city with later batches.

    I also have a externally supplied database with all my county's street, suburb and city details. I would like to match-up my town and city columns with the external database to see if the should be in the first batch to be coded.

    The problem is that I see that some of the address have for instance the suburb in the address line 4, the city details might be in the suburb or any other funny combination that you can think of.

    The soundex and differnce is not really what Im looking for, but i do see that some people uses Q-Gram to do this as well. Is the Q-Gram the way to go maybe. Should I build-up a table which check for the soundex, difference and then check on the Q-Gram as well of the 2 differnent databases/ tables and then decide if they should be passed first for geocoding?

    Or is there and easier way of doing things?

    Also just to add, I see that some of my data might have two letters swapped around for a town name for instance. When using the Q-Gram, I also sort the townname alphabetically first, i.e. asdf would become adfs, and then I do the Q-Gram, is this a good idea to check on?

    I dont want to go to far down this path only to discover that this is not the right approach.

    Many thansk

  • I've done a fuzzy lookup clean up of addresses before using SSIS. The datasource would be your raw data with a fuzzy match on you clean datasource.

    Do a search on the fuzzy matching of SSIS and let me know you need any more help.

  • I should probably mention its an enterprise/development only component of SSIS

  • Ok, will check on the fuzzy lookp, luckly Im already using SSIS, so should be a problem. (Didnt even think about the fuzzy), thanks

  • I found it to be much more accurate than functions like Soundex

  • yes ive never used the soundex before but for instance the result for Wentworth and Wonderboom is the same. The one is English name and the other Afrikaans place name, it doesnt sound the same at all. Thats why Im looking for other solutions. Thanks

  • Well, actually I would write here, even it is few years later as many would need similar help.

    You would never be able to clean the database with 10 million records where addresses are shifted between address tokens(mixed content).

    I would propose couple approaches to handle situation like that.

    1. concatenate all columns that contain address tokens into one column, then try to geocode a sample  - like 100k rows. write the results back

    2. Create your own Recurrent Neutral Algorithm, train it, and run all 10 million records. You basically need to build your own street parser.

    Use bidirectional RNN.

    Take a look this presentation for more details regarding how to geocode large amount of data

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply