Querying Postcodes gracefully

  • Recently I've been getting an influx of requests asking me if I can find clients on our system that have a postcode found in a provided list (normally an Excel Document, however, could be our own). Although this is a fairly "simple" question, considering the volume of data that is sometimes provided, or range of customer tables I have to check, this can sometimes be quite the slow process from a query end (it really doesn't run fast sometimes).

    The main problem, at least I believe, is down to the formatting of a UK postcode. They SHOULD be in one of the following formats:
    X0 0XX
    XX0 0XX
    X00 0XX
    XX00 0XX
    X0X 0XX
    XX0X 0XX

    The problem is, both in our customer tables and often the data I'm supplied, the space could be missing. So, say I'm provided the postcode AB12 3CD. I would need to check for AB12 3CD and AB123CD. However, because I don't know how reliable the source data is, the postcode could have been supplied in the format AB123CD (which would also need to match AB12 3CD and AB123CD).

    Now, I'm all too aware that doing the following is just bad practice and slow:
    REPLACE(Customer.Postcode,' ','') = REPLACE(Data.Postcode,' ','')
    At the moment, I've therefore gone with something along the lines of
    :
    Customer.Postcode IN (Data.Postcode,REPLACE(Data.Postcode,' ',''),STUFF(Data.Postcode, LEN(Data.Postcode) - 2,0, ' ')))
    This, on the current work I did, had a significant improvement (5% cost compared to 95% cost in the same batch), as I could use the Index on the Customer Table, however, I still know the data on the right side of the JOIN isn't sargable.
    For the data I worked on today, that wasn't a massive problem (I only had 300 odd postcodes to compare to our Customer database). On the other hand I know all too well I'm going to get a bigger list at some point again.

    Now, yes I could set up a staging table, I have considered this, but some of data could be held within our own customer data tables, so I'll simply be comparing two tables (rather than an external source which I can manipulate). In the above SQL, this would mean i could use the index on the postcode for the left side of the join, but not the index on the right side, so could be quite slow. Thus my question; is there, apart from putting the data in a staging table (and reformatting it),  a more graceful way? If I can't trust either side of the data (ours or external), that means that I may need to stage both tables, which depending on the volume of data, could be a very costly process unto itself. I'm not sure what gains that would have (if any?).

    I can also imagine that many of you may well ask me something along the lines of "Well, why not fix your data in the database first, and then ensure that formatting is preserved". Unfortunately our database is effective a read only environment. The data all comes from what is likely a Flat File Mainframe system that replicates the data into a SQL database. That process is one way, updates on the mainframe go to SQL, however, updates to SQL do not go back to the Mainframe. Thus amended the data wouldn't help, as all those changes would be lost when either the record is updated, or the entire database is refreshed (which happens at least once a week).

    If I haven't been clear, or you have any questions let me know. I'm really looking for insight, rather than someone to do the work for me, and any pointers you might have.

    Thanks all!

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • My view on this is to create a computed column on the table in question which removes all the spaces and then stick an index on top of that column.  Then you can stage your excel sheet with spaces removed and join on the two space less post code columns.  Saves worrying about sargability, the only issue comes with if you can modify the schema on the SQL side without the mainframe having a fit when it tries to update the data.

  • What about using an indexed computed column and simply query without spaces? Obviously that takes more storage space, but it might work.

    EDIT: Seems like Anthony had the same idea and posted before I did. 🙂

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • anthony.green - Wednesday, May 3, 2017 9:23 AM

    the only issue comes with if you can modify the schema on the SQL side without the mainframe having a fit when it tries to update the data.

    Heh, unfortunately I would be very surprised if the server had a fit.

    @luis:
    A definite consideration. Fortunately, the client table is unlikely to be changed, so provided we do this after the BULK INSERT, it'll remain there. Some tables, however, we're not so lucky on. If it wants to update a column, the back end is sometimes more than happy to drop any Views and tables it needs to and then recreate them and bulk insert the data back in. Yeah, really helpful when you did have custom indexes on that table and especially as it's a LIVE database...

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Would something like this work for you?

    WITH
        cte_PostalVariations AS (
            SELECT DISTINCT    -- only use DISTINCT if there is a possibility of dupes in the source request.
                PCV = LTRIM(STUFF(pc.PostalCode, sl.StuffLocal, 0, ' '))
            FROM
                ( VALUES
                    ('AB12 3CD'), ('AB123CD'), ('AB213CD'), ('1234567'), ('98 12345')    -- copied from source of request.
                ) pc (PostalCode)
                CROSS APPLY (VALUES (REPLACE(pc.PostalCode, ' ', '')) ) r (PostalCode)
                CROSS APPLY ( VALUES (1), (3), (4), (5)) sl (StuffLocal)
            )
    SELECT
        c.*
    FROM
        dbo.Customers c
        JOIN cte_PostalVariations pv
            ON c.PostalCode = pv.PCV;

    Just a thought...

  • Change your table to never store the space in the post code column (you can use a trigger to do this, at least until you change the app; you can add a tinyint column to indicate the original location of a space if you need to).

    Then you can simply remove all spaces from search values before doing the search.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • ScottPletcher - Wednesday, May 3, 2017 4:05 PM

    Change your table to never store the space in the post code column (you can use a trigger to do this, at least until you change the app; you can add a tinyint column to indicate the original location of a space if you need to).

    Then you can simply remove all spaces from search values before doing the search.

    This, unfortunately, would not work, as the tables are dropped when a BULK INSERT is performed (weekly). Indexes, triggers, etc, are all created after data insertion. Plus, not sure I like the idea of the database not reflecting the application front end/Mainframe back end.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Okay Thom, if the load process drops and rebuilds tables when you fire your BULK INSERT, how about doing an ALTER TABLE to add the persisted computed column after you load it?  I'm not sure if this will fit into your load process or not.

    Other than that, I'd say to go with a staging table.  I don't load data without first loading it to a staging table for cleaning and verification.  Then you can have your persisted computed column and index it to your heart's content.

  • First attempt at a post here, so please bear with me. I can't claim to have written this, but have adapted for use elsewhere, but one of our devs created a UD function to format an incoming postcode.

    The function is
    ALTER FUNCTION [dbo].[uf_FormatPostcode]
        
        
        (@Postcode VARCHAR(30))

        RETURNS VARCHAR(30)
    AS
    BEGIN

        DECLARE @Return VARCHAR(30)
        
        SELECT @Return =
      CASE WHEN LEN(LTRIM(RTRIM(@Postcode))) BETWEEN 5 AND 7
             THEN
                CASE WHEN
                    PATINDEX('[a-z][1-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                    PATINDEX('[a-z][1-9][0-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                    PATINDEX('[a-z][a-z][1-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                    PATINDEX('[a-z][a-z][1-9][0-9][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                    PATINDEX('[a-z][1-9][a-z][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                    PATINDEX('[a-z][a-z][1-9][a-z][0-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1 OR
                    PATINDEX('[a-z][1-9][1-9][a-z][a-z]',LTRIM(RTRIM(@Postcode))) = 1
                THEN
                    LEFT(LTRIM(RTRIM(@Postcode)),LEN(LTRIM(RTRIM(@Postcode)))-3) + ' ' + RIGHT(LTRIM(RTRIM(@Postcode)),3)
                ELSE
                    @Postcode
                END
            ELSE
                @Postcode
      END

        
        RETURN @Return

    END

    I insert the initial table into a #table with the postcode correctly formatted and apply an index - and then use this as the matching table.

    Regards
    Mike

  • Thom,

    I'd suggest to follow the advice from Scott.

    Except, may be, the insistence on removing spaces.

    You actually do not need to that, but you must make sure you bring all post codes coming from all possible sources to *consistent* formatting.

    Whatever string is entered into the "search by post code" field you need to format it first according to the rule applied on the mainframe, and only after that use it in a query.

    _____________
    Code for TallyGenerator

  • To aid searching, I would look at left justifying the area and right justifying the district.
    You will probably need to take into account partial postcodes and mix ups between 0(zero)/capital o and 1(one)/capital i.
    Something like the following should get you started:

    CREATE FUNCTION dbo.SearchPostcode
    (
     @Postcode varchar(8)
    )
    RETURNS TABLE
    AS
    RETURN
    (
      WITH Patterns
      AS
      (
       SELECT *
       FROM
       (
        VALUES ('[A-Z][0-9][0-9][A-Z][A-Z]', 1, 2, 1, 3)
          ,('[A-Z][A-Z][0-9][0-9][A-Z][A-Z]', 2, 3, 1, 3)
          ,('[A-Z][0-9][0-9][0-9][A-Z][A-Z]', 1, 2, 2, 3)
          ,('[A-Z][A-Z][0-9][0-9][0-9][A-Z][A-Z]', 2, 3, 2, 3)
          ,('[A-Z][0-9][A-Z][0-9][A-Z][A-Z]', 1, 2, 1, 4)
          ,('[A-Z][A-Z][0-9][A-Z][0-9][A-Z][A-Z]', 2, 3, 1, 4)
       ) P (Pattern, AreaLength, DistrictStart, DistrictLength, Rest)
      )
      SELECT LEFT(LEFT(X.PostCode, P.AreaLength) + '_', 2)
       + RIGHT('_' + SUBSTRING(X.PostCode, P.DistrictStart, P.DistrictLength), 2)
       + RIGHT('_' + RIGHT(X.PostCode, P.Rest), 4) AS SearchCode
      FROM Patterns P
       CROSS JOIN (SELECT REPLACE(@Postcode, ' ', '')) X (PostCode)
      WHERE X.Postcode LIKE P.Pattern
    );

    You should then be able to check the results with something like:

    SELECT *
    FROM YourTable Y
        CROSS APPLY (SELECT SearchCode FROM dbo.SearchPostcode(Y.Postcode)) X;

  • Thom A - Wednesday, May 3, 2017 4:39 PM

    ScottPletcher - Wednesday, May 3, 2017 4:05 PM

    Change your table to never store the space in the post code column (you can use a trigger to do this, at least until you change the app; you can add a tinyint column to indicate the original location of a space if you need to).

    Then you can simply remove all spaces from search values before doing the search.

    This, unfortunately, would not work, as the tables are dropped when a BULK INSERT is performed (weekly). Indexes, triggers, etc, are all created after data insertion. Plus, not sure I like the idea of the database not reflecting the application front end/Mainframe back end.

    Formatting characters are generally left out of database data, as they just waste space.  For example, as  you know, date/datetime does not internally store any of the formatting chars.  U.S. SSNs should be stored as 000000000 not 000-00-0000.

    Equally importantly, the code then becomes a very simple column = REPLACE(input_value, ' ', '') comparison, easy to write and fully sargable.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Thanks guys for all the replies. I'm experimenting with the indexed computed column, as this is probably the most viable option for me. What I'm trying to overcome now is getting that to appear in the views we have (which are much more end friendly, for reasons I'm not going to dive into). That, however, is definitely a "feature" of the 3rd party software we use, but I'm sure I'll (eventually), find a solution.

    I have, on a side note, also built a function that splits the postcode into the possible permutations, using my own and some ideas others have given me above, which works fairly well as well. It's nice to now have a couple of options and use the other when one seems to perform slower than I'd like. 🙂

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • ScottPletcher - Thursday, May 4, 2017 7:51 AM

    Formatting characters are generally left out of database data, as they just waste space.  For example, as  you know, date/datetime does not internally store any of the formatting chars.  U.S. SSNs should be stored as 000000000 not 000-00-0000.

    Equally importantly, the code then becomes a very simple column = REPLACE(input_value, ' ', '') comparison, easy to write and fully sargable.

    Although I agree, this would be easier to store the postcodes without the sapce, and put it as a formatting process on the front end, this isn't something I have control over. Third party application, third party database.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply