Parsing address

  • Hi there,

    I have a table with address field of 4 million records. This field has Street address + city, province, zipcode in the address. I want to separate them as different columns.

    Here are different values of addresses

    First 5 digits are zipcode. kraj,kray & oblast mean region, anything after G and before comma is city name

    create table temp (address varchar(200) not null)

    insert into temp values('12345, Sverdlovskaya oblast, g Ussurijsk, per Leningradskij, d')

    insert into temp values('12345, Primorskij kraj, g Ussurijsk, ul Vorovskogo, d 91')

    insert into temp values('12345, g Ussurijsk, ul Vorovskogo, d 91')

    insert into temp values('12345, STAVROPOL KRAY, g Ussurijsk, ul Vorovskogo, d 91')

    I know how to parse zipcode

    select SUBSTRING(Address,0,6) as "zipcode" from temp

    but I am clueless how to go about writing a parser for city and region

    Thanks

  • use patindex to find the location of your delimiters and use those loactions for start and end in substring.

  • I'm not completely sure that this will work for you, as I'm not sure if I'm reading it right. You can use Jeff Moden's DelimitedSplit8K function to split out the different parts of your address and then query them by position. If you aren't familiar with DelimitedSplit8K, check the second link in my signature. It's well worth the time to read the article and I think you'll be happy with performance.

    select address,

    zip = MAX(case when s.itemnumber = 1 then s.item end),

    region = MAX(case when s.itemnumber = 2 then s.item end),

    city = MAX(case when s.itemnumber = 3 then s.item end),

    address = MAX(case when s.itemnumber = 4 then s.item end)

    from temp

    cross apply DelimitedSplit8K(address, ',') s

    group by address;

    This approach just requires that the order of the delimited parts is in the same sequence in the string. Please adjust the parts of the address if I got them wrong. HTH.

  • A variation on Ed's query to take into account the rules that you mentioned. It still uses the Delimited8kSplitter, but checks the items rather than the itemnumbers

    SELECT address,

    MAX(CASE WHEN itemnumber = 1 THEN item END) postcode,

    MAX(CASE WHEN item like '% oblast' or item like '% kra[yj]' THEN reverse(stuff(rev,1,charindex(' ',rev),'')) END) region,

    MAX(CASE WHEN item like 'g %' THEN stuff(item,1,2,'') END) city,

    MAX(CASE WHEN item like 'd %' THEN stuff(item,1,2,'') END) num

    FROM temp t

    CROSS APPLY (

    SELECT itemnumber,

    ltrim(item) item,

    reverse(ltrim(item)) rev

    FROM DelimitedSplit8K(address,','))x

    GROUP BY address;

    I hope I was right in guessing that you would want the identifier strings stripped out of the result.

    EDIT: Cleaned it up a little

  • Nice work, Micky. Thanks for picking up the part that wasn't sinking in for me.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply