Canadian Postal Code and US Zip

  • I have an import process where we receive US Zip and Canadian Postal Codes in the same field. We use the left most character to determine the region in which that record belongs.

    The problem is that we receive files that contain erroneous values because the databases under which they were entered have no validation (probably AS400 systems) and the address information is corrected and handled external from the organization to which sends us the file. In other words there is no way for us to correct this problem at the source, we have to live with receiving garbage data. We import many disparate files into one nvarchar/varchar table, which is where this function would work.

    I wrote a function to clean the postal code/zip please tell me if there is a better way to write it.

    I found details on how to format postal codes and zip codes from these two Wikipedia articles:

    http://en.wikipedia.org/wiki/Postal_codes_in_Canada

    http://en.wikipedia.org/wiki/ZIP_code

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    -- =============================================

    -- Author:W McCarty

    -- Create date: 1/23/2012

    -- Description:This function will accept a Canadian Postal Code or US Zip and strip unwanted characters

    -- and then check validity of the Code, and pass back a valid Code, or Null if none can be found

    -- =============================================

    Alter Function [dbo].[CleanPostalCode](@Temp VarChar(1000))

    Returns VarChar(1000)

    AS

    Begin

    -- Declare a temporary value that will be returned leaving the passed value intact

    declare @RetStmt VarChar(1000)

    -- Strip out non alpha-numeric characters

    While PatIndex('%[^a-z0-9]%', @Temp) > 0

    Set @Temp = Stuff(@Temp, PatIndex('%[^a-z0-9]%', @Temp), 1, '')

    -- If there is a Canadian Postal Code in the cleaned string, then set the return statement to that code

    If PatIndex('%[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][0-9][a-eghj-npr-tv-z][0-9]%', @Temp) >0

    Begin

    set @RetStmt = Substring(@Temp, PatIndex('%[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][0-9][a-eghj-npr-tv-z][0-9]%', @Temp),6)

    End

    -- Else look for a numeric string and try to pass back a valid US Zip Code

    Else

    Begin

    -- Make a string for comparison

    declare @numstr VarChar(1000)

    declare @numindex int

    set @numindex = 0

    Set @numstr = '[0-9]'

    -- Loop through the string to find a contiguous numeric string

    While PatIndex('%' + replicate(@numstr, @numindex + 3 )+ '%', @Temp) > 0

    Begin

    set @numindex = @numindex + 1

    End

    -- If there was a numeric string that was 3 digits or longer then

    if @numindex > 0

    Begin

    -- First set the @Temp variable so that it contains the result of the numeric search

    set @Temp = substring(@Temp, PatIndex('%' + replicate(@numstr, @numindex + 2 )+ '%', @Temp),@numindex +2)

    -- Then reset the @NumIndex so we can use it again (do this step after @temp reset)

    set @numindex = 1

    -- Only look at digits that taken as a whole are greater than 501 (digits trimmed of left zeros)

    if cast(@temp as int) >= 501

    Begin

    -- Enter a loop until we find a digit string that is greater or equal to 501 and the number cannot be three characters in length

    while substring(@temp,@numindex,5) < 501

    set @numindex = @numindex + 1

    -- Make the return statement the first position that the first five is greater than 501, and take nine digits from that location

    set @retstmt = substring(@temp,@numindex,9)

    -- Enter a loop if the return statement is less than 9 digits long and just add leading or trailing zeros

    if len(@retstmt) < 5

    begin

    while len(@retstmt) < 5

    Set @retstmt = '0' + @retstmt

    end

    if len(@retstmt) > 5 and len(@retstmt) < 9

    begin

    while len(@retstmt) < 9

    Set @retstmt = @retstmt + '0'

    end

    End

    Else

    Begin

    set @retstmt = Null

    End

    End

    else

    begin

    set @retstmt = Null

    end

    End

    Return @RetStmt

    -- End of Function

    End

    GO

  • Before even looking at this, from the sounds of it, you are running this against a table of values, not a single value. Is this correct? If so, the first thing I would do is to move this entire process into either a Table Valued Function or a Stored Procedure to handle it as a set based operation rather than 1 record at a time, which will be very very very slow.

    If I am incorrect, let me know and I will review the Scalar Function you presented.

    Fraggle

  • Hello, thank you for replying! Yes we currently look at records one by on, because we receive inconsistent data files from many different sources, and these data files change their 'alignment', file type, and format often. So we have to rip the files into a single varchar staging table and then process and modify the data. We are redesigning our process of importing files, but for now this is the way it sits.

  • If I follow your description correctly, you are pulling data into a staging table, then running this on the staging table in order to get the postal code out of that.

    Can you provide the structure (create script) for the table, and an insert statement that puts some sample data in it? A couple of dozen rows should suffice for testing purposes.

    I think the whole thing can be massively simplified pretty easily, but I need sample data to test on.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • We import the disparate files into this table, row by row (with column headings from text files, xls, csv, etc)

    [Code]

    CREATE TABLE [dbo].[raw](

    [tableID] [int] IDENTITY(1,1) NOT NULL,

    [col001] [varchar](500) NULL,

    [col002] [varchar](500) NULL,

    [col003] [varchar](500) NULL,

    [col004] [varchar](500) NULL,

    [col005] [varchar](500) NULL,

    [col006] [varchar](500) NULL,

    [col007] [varchar](500) NULL,

    [fileName] [varchar](500) NULL,

    [loadDate] [varchar](50) NULL CONSTRAINT [DF_rawstage_loadDate_2] DEFAULT (getdate()),

    [uniqueid] [uniqueidentifier] NULL CONSTRAINT [DF_rawstage_uniqueid] DEFAULT (newid())

    ) ON [PRIMARY]

    [/Code]

    I have attached sample data in the file raw.txt (CSV)

    The data would get aligned and put into a table like this

    CREATE TABLE [dbo].[stage](

    [postal] [varchar](250) NULL,

    [postalShipTo] [varchar](50) NULL,

    [FileName] [varchar](500) NULL,

    [loadDate] [smalldatetime] NULL,

    [tableID] [int] IDENTITY(1,1) NOT NULL,

    [uniqueid] [uniqueidentifier] NULL

    ) ON [PRIMARY]

    The stage table then has other processes performed on it and eventually the data is moved into a final table that reports are generated from.

  • Okay, it looks like what you need is to parse out any rows where Col01, Col02, etc., have a postal code in them. It can be any one of those columns, and it'll be by itself in the column. Is that correct?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Thank you for you reply. Yes we do have a process to parse out the locations of postal codes (which we also are refining) but anything that looks like a postal code is caught and then is 'cleaned'. If the thing that looks like a postal code is not a valid US or CDN code, we set it to NULL and move on. Where the code is missing or only one of bill and ship is there we leave it null or copy the existing to the null respectively.

    Edit, the way we get the files, sometimes there is other garbage in the field, and where there is more that one piece of data, like address and postal code in one field we set it to null. So in other words most of the time postal code will be in only one field sometimes with extraneous characters. I.E. Zip 8950 shown as 8950USA.

  • Okay.

    8950 isn't a valid Zip code, but I get the idea of what you're saying.

    Are there potentially fields with numeric data in them that might mimic a Zip code, but not be one? Like a street address that starts with a 5-digit number, e.g.,: "98112 E 45th Ave, Podunk, TX". 98112 isn't a Zip code in this case, it's a street number, but it also happens to be a valid Zip code in Seattle, WA.

    Is it possible for data to have that kind of thing in it?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Yes it is very possible for an address to be in the Zip field, but it does not happen all that often. The majority of the data is valid, around 99%. For those cases we just would want to take the number, as we have both the bill and ship to use.

    The more likely scenario is for the data to just contain something in it that the person entering felt like putting in that day like 78155IHAVEACAT, or 78144USA to say that it is in the USA. Sometimes the data does have phone numbers as well, so for that we take whatever looks like a code and press.

    At the moment we consider the US as a whole and therefore do not do much validation on if the actual code maps out to a correct address. When the left most character in the PCode we use is a number we consider it USA. I suppose we could then try to validate the postal against the address, city and state/province, we could have this running as a separate process after the postal is stripped out, perhaps as a process running on the stage table. In which we would parse out the Zip/PCode to its separate parts and compare it with a database of cities and states. But for now its not required

    The way I count a valid US ZIP is that it is 3-9 digits in length, and is >= 501. The Canadian PCode is much simpler, however it is possible to have a code that is in the valid format but does not exist.

    Normally if one of the bill or ship Pcodes is Canadian we would favor that one.

  • Okay.

    Here's one way that might make this more efficient:

    IF OBJECT_ID(N'tempdb..#raw') IS NOT NULL

    DROP TABLE #raw ;

    CREATE TABLE #raw

    ([tableID] [int] NOT NULL,

    [col001] [varchar](500) NULL,

    [col002] [varchar](500) NULL,

    [col003] [varchar](500) NULL,

    [col004] [varchar](500) NULL,

    [col005] [varchar](500) NULL,

    [col006] [varchar](500) NULL,

    [col007] [varchar](500) NULL,

    [fileName] [varchar](500) NULL,

    [loadDate] [varchar](50) NULL

    CONSTRAINT [DF_rawstage_loadDate_2] DEFAULT (GETDATE()),

    [uniqueid] [uniqueidentifier] NULL

    CONSTRAINT [DF_rawstage_uniqueid] DEFAULT (NEWID())) ;

    INSERT INTO #raw

    (tableID, col001, col002, col003, col004, col005, col006, col007, fileName, loadDate, uniqueid)

    VALUES (1, 'Bill To Customer Postal Code', 'Ship To Customer Name', 'Prod Name', NULL, NULL, NULL, NULL,

    '201112 File1.txt', 'Jan 25 2012 12:34PM', '61e9b0c9-4cb2-4048-a7af-c945fb1d107d'),

    (2, 'V5J 5K3', 'ABC Company', 'Monitor', NULL, NULL, NULL, NULL, '201112 File1.txt', 'Jan 25 2012 12:34PM',

    '1f2e8f4f-2c02-41e0-b4ef-8c710a084192'),

    (3, '88451', 'AB1 Company', 'Monitor', NULL, NULL, NULL, NULL, '201112 File1.txt', 'Jan 25 2012 12:34PM',

    'efa29c24-6799-4d40-a500-e5b0aac68a70'),

    (3591, '1233prodcode', 'manfcode12356', '299.68', 'ON', '8845USA', 'L4W 1E1', NULL, '20111201File2.txt',

    'Jan 25 2012 12:38PM', 'c5aaab9d-e7fd-4053-8ca5-f473caef3604'),

    (3592, '1233prodcode', 'manfcode12356', '599.36', 'ON', '78458', 'M4G 4B5', NULL, '20111201File2.txt',

    'Jan 25 2012 12:38PM', '8d03f757-161a-449c-aa20-4697a593f093'),

    (3593, '1233prodcode', 'manfcode12356', '299.68', 'ON', 'L4B-4W3', 'L6M 2R7 {End of line Skype Highlighting}',

    NULL, '20111201File2.txt', 'Jan 25 2012 12:38PM', '2a4543a0-b532-46c8-9748-705c486edbe5'),

    (1175, 'Qty.', 'Unit price', 'Postal code', NULL, NULL, NULL, NULL, '20111201File3.txt', 'Jan 25 2012 12:36PM',

    'f8682b0a-90ab-4cb2-8385-f1aac0b93f0b'),

    (1176, '1', '89.99', 'V2S 3T2', NULL, NULL, NULL, NULL, '20111201File3.txt', 'Jan 25 2012 12:36PM',

    '9314f670-badc-4a55-9144-7cb4bef6befa'),

    (1177, '1', '93.99', 'V5H 2B1', NULL, NULL, NULL, NULL, '20111201File3.txt', 'Jan 25 2012 12:36PM',

    '1a0cf2fa-1404-41ad-b38e-abda9c835133'),

    (4432, 'Sold-to Province', 'Sold-to Code Postal', 'Take Slm Div', 'Invoice #', 'Invoice Date', 'Category',

    'Ship-to Code Postal', '20111201File4.txt', 'Jan 25 2012 12:39PM', '6b1b0d6e-48e7-4f3e-8c1a-2c8c9e0fa7ad'),

    (4433, 'ON', 'L1W3K1', '61-3', '3302254', '51097', 'M', 'Z1W3K1', '20111201File4.txt', 'Jan 25 2012 12:39PM',

    '8389dccd-4607-48b3-b522-da98123ea8b8'),

    (4434, 'PQ', 'G5C1C8', '92-2', '3302323', '51097', 'M', 'G5C1C8', '20111201File4.txt', 'Jan 25 2012 12:39PM',

    '26bd6258-a525-4839-b675-d4eb0d5a6615'),

    (4435, 'TX', 'G5C1C8', '101-2', '3302670', '51097', 'M', 'M1R2T5', '20111201File4.txt', 'Jan 25 2012 12:39PM',

    '1817c7bf-0535-4038-aef4-c75fe2accb4b') ;

    SELECT uniqueid,

    PossibleZip

    FROM #raw

    CROSS APPLY (SELECT COALESCE(col001, '') + COALESCE(col002, '') + COALESCE(col003, '') + COALESCE(col004, '')

    + COALESCE(col005, '') + COALESCE(col006, '') + COALESCE(col007, '') AS Conc) AS Conc

    OUTER APPLY (SELECT SUBSTRING(Conc, Number, 5) AS PossibleZip

    FROM dbo.Numbers

    WHERE SUBSTRING(Conc, Number, 5) LIKE '[0-9][0-9][0-9][0-9][0-9]'

    UNION ALL

    SELECT SUBSTRING(Conc, Number, 10) AS PossibleZip

    FROM dbo.Numbers

    WHERE SUBSTRING(Conc, Number, 10) LIKE '[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'

    UNION ALL

    SELECT SUBSTRING(Conc, Number, 7) AS PossibleZip

    FROM dbo.Numbers

    WHERE SUBSTRING(Conc, Number, 7) LIKE '[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][ [-]][0-9][a-eghj-npr-tv-z][0-9]'

    UNION ALL

    SELECT SUBSTRING(Conc, Number, 6) AS PossibleZip

    FROM dbo.Numbers

    WHERE SUBSTRING(Conc, Number, 6) LIKE '[a-eghj-npr-tvxy][0-9][a-eghj-npr-tv-z][0-9][a-eghj-npr-tv-z][0-9]')

    AS Zips ;

    The Zips sub-query just has a list of patterns that you would consider valid for a postal code. You can add "and not" statements to any of the Where clauses to exclude patterns that are a negative (false-positive) match.

    You can add more patterns to it easily. Just follow the pattern of the queries in there.

    Does that help?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • GSquared, thanks for the reply! I am testing your implementation, it is a much different concept than what I was using and I think it has solved a few other problems too!

    One other question the dbo.Numbers table, do I just need the 0-9 numbers for the purposes of this query?

  • Sorry about that. I have a Numbers table that includes integer values from 0-10,000. Comes in handy for dozens of uses.

    You probably only need a limited scope for this particular solution, but I recommend building one with a larger scope for future uses.

    Should have mentioned that in the reply.

    Edit: To work on this solution, you'll need numbers from 0 to the length of the longest row in your staging table.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply