Clean international Phone Numbers

  • I have about 6300 phone numbers from around the world that I need to clean up before loading into SQL Server. I have gotten to the point where I have removed non-numeric characters and other junk data, but now I am somewhat stuck. Some of the numbers that were entered in the US have the country code (+xx or just xx). Some have the international prefix for dialing an international number from that country or the US (011). We would like to store just the in-country number, for example in the US that would be 10 digits, xxxxxxxxxx. We are planning on using masking, stored in our countries table, to display the number appropriately on the UI.

    Here is what I think would work, but I can't get the implementation down. In a script, load the international prefix and if the first n characters match it, remove them, and the same thing for the country code.

    Anyone know how to load a datatable in a script component set as a transform?

  • Is it required to store the country codes as well? If not, you could store the right most 10 digits, that will give the phone number, right? By the way, this is under the assumption that all countries have 10 digit phone numbers 🙂

    --------------------------------------------------------------------------------------
    Save our world, its all we have! A must watch video Pale Blue Dot[/url]

  • Not all countries have 10 character phone numbers. Some are more, some are less.

  • These clean up jobs are always tricky.

    I think you're doing the right thing already, but I'd go even further with it. The first part is trying to standardise the input as much as you can so you can set up rules downstream. e.g. In a derived column, use an expression to replace your '+' with 00, remove any brackets and spaces and perhaps replace any single leading zeros with 00 if the length of the string is greater than say 10. Now you should have just a string with a list of numbers. Hopefully you have trapped all the possible variations and standardised them in this step.

    I would then split the rows (conditional split) using this derived column into international numbers (00 prefix) and national numbers. Further work is going to be required on the international ones. What you could do is create a reference table for your international dialling codes and use a look up to return the codes to the pipeline based on say the first 3 characters of your string after the leading 00. If a record is found, then use an expression to cut those left most characters out of your string.

    Following that if you have any remaining, you should be able to identify a particular rule that applies to these so you can include them in the cleanup too.

    Finally when you insert these into the destination, I'd want to keep the countryid as a reference even if its not part of the telephone number column as you never know when you might want to use it.

    Hope this helps.

    Kindest Regards,

    Frank Bazan

  • Frank,

    Thanks for the input. I actually think I have it figured out. I am doing something similar to what you are suggesting. In my destination system I have a countries table where I am storing the international prefix and the countty code. I have added another Source and I am using a Merge Join to get it all into 1 dataset. Then in a script component I am removing all the non-numeric characters and stripping the extension when it exists. Then I just have the "clean" phone number. Then I do a compare on the country code like this:

    If strPhone <> String.Empty Then

    strCode = strPhone.Substring(0, CodeLength)

    If strCode = Code Then

    strPhone = Phone.Substring(CodeLength)

    End If

    End If

    And yes, I do need the check for empty string as there are some phone numbers that, once purged of non-numeric characters are empty.

    I'll post how the final results turn out.

    As you said the cleanup is tricky because the source system did absolutely no checking for phone number validity. I had names, email addresses, and other "data" in the phone number:hehe:

  • The solution in my last post appears to be working just like I want it to. Might not be the prettiest code in the world but it is getting me the data I think I want.

  • Oh... You really need to double check and make sure, Jack. As you probably know, Country Codes can be 1, 2, or 3 digits long... what you might not know (depending on what you're doing) is that International calls can also have a "city" code of 0 to 4 digits. Unless your lookup tables are spot on can you start looking for perfect matches starting at the 7th digit and working your way back, you can have a hell of a mess on your hands. And, don't forget that calls to Canada, Guam, Marianas, Puerto Rico, and other US Territories etc, are all under the North American Numbering Plan Association (NANPA) control and can all be reached with 1+NPA+NXX dialing.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the input Jeff. Needless to say I learned a lot about international phone numbers recently. I am pretty confident I have the right codes. I know I have the right country codes, the problem with the international dialing prefix is that the majority of the data was entered in the US so they have the 011 US/North America international prefix. The spot check I did looks pretty good, but it will take the users to verify it.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply