Identify postcode from multiple fields

  • Hello,

    I have a table with five fields used to hold address data. Due to poor practices the users have been able to enter any data in any field. They have now asked me to find the post/zip code from each record for extracting to another application. Any tips on how I can identify the post/zip code and extract to a separate table?

    At the moment I have a VB6 application that uses regular expressions to perform this role, but I'm interested to see if I can now do this in T-SQL as I'd like to drop the VB6 app, or at a minimum replace it with something .NET based.

    I'm running an instance of SQL Server 2008 R2 to host the db.

  • mike.dinnis (6/12/2013)


    Hello,

    I have a table with five fields used to hold address data. Due to poor practices the users have been able to enter any data in any field. They have now asked me to find the post/zip code from each record for extracting to another application. Any tips on how I can identify the post/zip code and extract to a separate table?

    At the moment I have a VB6 application that uses regular expressions to perform this role, but I'm interested to see if I can now do this in T-SQL as I'd like to drop the VB6 app, or at a minimum replace it with something .NET based.

    I'm running an instance of SQL Server 2008 R2 to host the db.

    You could probably do this as a CLR proc. Just have to convert your VB6 to C#/VB.NET.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 โ€“ Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • SQL Server does support regular expressions: http://msdn.microsoft.com/en-us/library/ms174214.aspx.

    The syntax is pretty straightforward.

    You could also do it in .NET, create an assembly in SQL Server that points to the .NET DLL and then create wrapper functions for the exposed methods in your DLL.

  • There are plenty of articles out there on getting started doing it .NET. Here's one: http://msdn.microsoft.com/en-us/magazine/cc163473.aspx.

    Depending on how you write it, this approach can be very efficient.

    After you get your DLL created, you'll need to create an assembly, make sure the CLR is enabled and then create wrapper functions for your methods. Here's an example:

    CREATE ASSEMBLY RegExFunctions

    FROM 'D:\path_name\SQLRegExp.dll';

    GO

    sp_configure 'clr enabled', 1;

    RECONFIGURE;

    go

    CREATE FUNCTION RegExIsMatch(@strPattern NVarchar(2000),

    @strString NVarchar(4000),

    @Options Integer) RETURNS Bit

    AS EXTERNAL NAME

    RegExFunctions.[namespace_name.class_name].RegExpIsMatch;

    go

    Hope this helps.

  • mike.dinnis (6/12/2013)


    Hello,

    I have a table with five fields used to hold address data. Due to poor practices the users have been able to enter any data in any field. They have now asked me to find the post/zip code from each record for extracting to another application. Any tips on how I can identify the post/zip code and extract to a separate table?

    At the moment I have a VB6 application that uses regular expressions to perform this role, but I'm interested to see if I can now do this in T-SQL as I'd like to drop the VB6 app, or at a minimum replace it with something .NET based.

    I'm running an instance of SQL Server 2008 R2 to host the db.

    You have a couple options:

    Option 1

    You don't need Regular Expressions to extract this data, SQL is powerful enough to extract zip/postal codes. You can use Dwain Champ's Pattern Splitter[/url] for this. Let me know if you want me to post an example.

    Option 2

    A fast and painless CLR alternative would be to use one of the Regular Expression CLR functions that ships with Master Data Services. They utilize the .NET flavor of Regex. I have written a couple articles about these. They are fast and easy to setup. Using RegexIsMatch, for example, the code would look like this:

    DECLARE @zip varchar(20)='02115-4653', --sample data

    @valid_zip varchar(30)='^[0-9]{5,5}([- ]?[0-9]{4,4})?$'; --regex for a valid zip code

    SELECT @zip AS zip,

    mdq.RegexIsMatch('02115-4653','^[0-9]{5,5}([- ]?[0-9]{4,4})?$',0) AS is_valid;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Some useful suggestions there, thank you.

    I had started to consider SQLCLR as probably the correct route to go, but at that time I hadn't appreciated that SQL supported regular expressions natively. Because of the apparent complexity (not demonstrated in the OP) I think that a straight forward SP using native regex functions is not the way to go. I also think I need to investigate Alan's ideas of pattern splitters and existing RegEx functions before making a decision.

  • Ed Wagner (6/12/2013)


    There are plenty of articles out there on getting started doing it .NET. Here's one: http://msdn.microsoft.com/en-us/magazine/cc163473.aspx.

    Depending on how you write it, this approach can be very efficient.

    After you get your DLL created, you'll need to create an assembly, make sure the CLR is enabled and then create wrapper functions for your methods. Here's an example:

    CREATE ASSEMBLY RegExFunctions

    FROM 'D:\path_name\SQLRegExp.dll';

    GO

    sp_configure 'clr enabled', 1;

    RECONFIGURE;

    go

    CREATE FUNCTION RegExIsMatch(@strPattern NVarchar(2000),

    @strString NVarchar(4000),

    @Options Integer) RETURNS Bit

    AS EXTERNAL NAME

    RegExFunctions.[namespace_name.class_name].RegExpIsMatch;

    go

    Hope this helps.

    Just some food for thought: If you use create the Microsoft.MasterDataServices.DataQuality Assembly you get a whole 8 Regex functions, string metric functions (Levenshtein, LCS, etc), xmltransform and other good stuff with just one assembly.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Here's a pure tsql way which performs reasonably well:

    SELECT *

    FROM (

    SELECT Postcode = 'BS16 1EJ' UNION ALL

    SELECT 'OX8 1TP' UNION ALL

    SELECT 'GL20 8LZ' UNION ALL

    SELECT 'JUNK'

    ) d

    CROSS APPLY ( -- pp

    SELECT IsPostcode = CASE

    WHEN PATINDEX('%[A-Z][0-9] [0-9][A-Z][A-Z]%',d.Postcode) > 0 THEN 1

    WHEN PATINDEX('%[A-Z][0-9][0-9] [0-9][A-Z][A-Z]%',d.Postcode) > 0 THEN 1

    WHEN PATINDEX('%[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]%',d.Postcode) > 0 THEN 1

    WHEN PATINDEX('%[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]%',d.Postcode) > 0 THEN 1

    WHEN PATINDEX('%[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]%',d.Postcode) > 0 THEN 1

    WHEN PATINDEX('%[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]%',d.Postcode) > 0 THEN 1

    ELSE 0 END

    ) pp

    โ€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.โ€ - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thank you, Alan. I hadn't looked into that and it is good to know.

    Truth be told, I would do it using SQL's regular expressions like Chris did. I've done it before, albeit not in the particular case where I had to decide what field to pick based on the result.

  • ChrisM@Work (6/12/2013)


    Here's a pure tsql way which performs reasonably well:

    It only performs reasonably well in the UK ๐Ÿ˜‰

    The SQL Guy @ blogspot[/url]

    @SeanPearceSQL

    About Me[/url]

  • I'm not sure how fashionable this method is around here, but you could create a look up table of zip codes, and match your columns to it.

    If you're expecting only US zips:

    http://federalgovernmentzipcodes.us/

    The primary location file should be fine.

  • erikd (6/13/2013)


    I'm not sure how fashionable this method is around here, but you could create a look up table of zip codes, and match your columns to it.

    If you're expecting only US zips:

    http://federalgovernmentzipcodes.us/

    The primary location file should be fine.

    It doesn't matter if it's fashionable or not, if it works for you, it works. It works for me with UK postcodes and is incredibly useful for address validation. Good point ๐Ÿ˜Ž

    โ€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.โ€ - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Thank you all for your suggestions and examples.

    I went for the compiled C# code in the end. Partially because I am comfortable with this, partially because it lent itself to additional steps a little more but mostly because I discovered that a colleague had already written it in a consumable format.

    Thanks,

    Mike

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply