Phone number question - detecting and replacing

  • Hello,

    I have a website/database app with a text column that stores a user's bio where users are not allowed to embed their phone number (e.g. similar to a dating website). I need to write a query that detects phone numbers and replaces them with XXXs.

    Thanks,

    Matt

    For example, I have written a query that detects records with phone numbers in (xxx) xxx-xxxx format, but the replacing is tough. Maybe RegExp?

    select userid, bio

    from users

    where bio like ('%([0-9][0-9][0-9])%')

  • In order to help we will need a few things:

    1. Sample DDL in the form of CREATE TABLE statements

    2. Sample data in the form of INSERT INTO statements

    3. Expected results based on the sample data

    Please take a few minutes and read the first article in my signature for best practices when posting questions.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • I respect your wish to have structured forum dialog, but it appears that, for this relatively simple question, helpful suggestions are possible without INSERT INTO and CREATE TABLE samples.

  • I'd recommend this be done in the client app, not in SQL. SQL's string handling is very weak compared to what the .net framework has.

    Also consider what happens when people use other formats. Where I live, the common format for a phone number (cell number mostly) has no brackets. It would be xyz xyz xyza or +ccyz xyz xyza. Then people leave the spaces out, or mistype one bracket or.....

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • matt6749 (9/23/2013)


    I respect your wish to have structured forum dialog, but it appears that, for this relatively simple question, helpful suggestions are possible without INSERT INTO and CREATE TABLE samples.

    I'd have to agree with Sean here. If you take the time to construct some sample data, you'll need to think through the kind of input (and mistakes) you're going to be up against.

    Then someone can help you with the clean and STUFF.

    On the other hand, if your data is all clean and good (BWAHAHAHAHA!) this can work:

    WITH PhoneNumbers (MyString) AS

    (

    SELECT 'phone numbers in (456) 555-2323 format'

    UNION ALL SELECT 'Hi! My name is Mimi. For a good time call (123) 555-1212'

    )

    SELECT a.MyString, NewString=STUFF(c.MyString, PATINDEX('%[0-9][0-9][0-9][0-9]%', c.MyString), 4, 'XXXX')

    FROM PhoneNumbers a

    CROSS APPLY (SELECT STUFF(a.MyString, PATINDEX('%[0-9][0-9][0-9]%', a.MyString), 3, 'XXX')) b (MyString)

    CROSS APPLY (SELECT STUFF(b.MyString, PATINDEX('%[0-9][0-9][0-9]%', b.MyString), 3, 'XXX')) c (MyString)


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks guys! Especially Dwain - funny as well!

  • Hi. I would personally take a different tack.

    1. I'd write a string-split UDF something like this: http://stackoverflow.com/questions/5968839/sql-server-2008-t-sql-udf-split-tailoring. This is a table valued UDF that returns the string split on the delimiter, with ORDINAL and STRINGVALUE as the columns.

    2. I would use RBAR and a cursor because of the complexity. After all, all of these are phone numbers: 3125551212, 13125551212, 1-312-555-1212, +1(800) 798-2234, (900) 478-TSQL, 333/444-5555, and other innumerable factors. Also you want to catch things like 'My number is 1.3,1;2\2|3%5^2@4#1=3' -- once people figure out that you filter just one or a few variations, they will try whatever works to evade your algorithm.

    2a. I would read the row and split the text field into a table variable using the above function, delimited on ' '. I would replace all characters that are not alphanumeric or space with nothing. Then test each stringvalue in the table variable, a virtual "word", for a length of 10-12 and if it is numeric and also if it begins with 3 or more numeric characters.

    2b. If you find that, delete the row from the table variable or alternative update that row to read '[CENSORED]' of whatever your user community would call for.

    2c. Finally I would use COALESCE to reassemble the string, putting a space after each StringValue retrieved, and update the table row with the coalesced stringvalue.

    I cannot stress enough that this is a better task for a CLR Procedure (blecch) in a trigger (blecch again) or a .NET program.

    Thanks

    John.

  • You have a nice plan and good ideas, but I would really suggest a different approach on the splitter. Check the following link to improve the performance of your strings splits.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (9/24/2013)


    You have a nice plan and good ideas, but I would really suggest a different approach on the splitter. Check the following link to improve the performance of your strings splits.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    After a quick read, I agree that your suggestion would speed things up.

    Thanks!

    John

  • dwain.c (9/23/2013)


    matt6749 (9/23/2013)


    I respect your wish to have structured forum dialog, but it appears that, for this relatively simple question, helpful suggestions are possible without INSERT INTO and CREATE TABLE samples.

    I'd have to agree with Sean here. If you take the time to construct some sample data, you'll need to think through the kind of input (and mistakes) you're going to be up against.

    Then someone can help you with the clean and STUFF.

    On the other hand, if your data is all clean and good (BWAHAHAHAHA!) this can work:

    WITH PhoneNumbers (MyString) AS

    (

    SELECT 'phone numbers in (456) 555-2323 format'

    UNION ALL SELECT 'Hi! My name is Mimi. For a good time call (123) 555-1212'

    )

    SELECT a.MyString, NewString=STUFF(c.MyString, PATINDEX('%[0-9][0-9][0-9][0-9]%', c.MyString), 4, 'XXXX')

    FROM PhoneNumbers a

    CROSS APPLY (SELECT STUFF(a.MyString, PATINDEX('%[0-9][0-9][0-9]%', a.MyString), 3, 'XXX')) b (MyString)

    CROSS APPLY (SELECT STUFF(b.MyString, PATINDEX('%[0-9][0-9][0-9]%', b.MyString), 3, 'XXX')) c (MyString)

    This is such an excellent solution Dwain! Brilliant! I was trying to solve this awhile back with PatternSplitCM but got stuck getting it to work for various phone numbers formats simultaneously.

    I made one small change for handling records without phone numbers:

    WITH PhoneNumbers (MyString) AS

    (

    SELECT 'phone numbers in (456) 555-2323 format'

    UNION ALL

    SELECT 'Hi! I''m Mimi. For a good time call (123) 555-1212'

    UNION ALL

    SELECT 'I''m important information too ;)'

    )

    SELECTa.MyString,

    NewString=

    ISNULL(STUFF(c.MyString, PATINDEX('%[0-9][0-9][0-9][0-9]%', c.MyString), 4, 'XXXX'),a.MyString)

    FROM PhoneNumbers a

    CROSS APPLY (SELECT STUFF(a.MyString, PATINDEX('%[0-9][0-9][0-9]%', a.MyString), 3, 'XXX')) b (MyString)

    CROSS APPLY (SELECT STUFF(b.MyString, PATINDEX('%[0-9][0-9][0-9]%', b.MyString), 3, 'XXX')) c (MyString)

    edit: code="sql" tag in the wrong place

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Excellent point Alan - a case that I clearly overlooked and a simple solution to it as well.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • matt6749 (9/23/2013)


    Hello,

    I have a website/database app with a text column that stores a user's bio where users are not allowed to embed their phone number (e.g. similar to a dating website). I need to write a query that detects phone numbers and replaces them with XXXs.

    Thanks,

    Matt

    For example, I have written a query that detects records with phone numbers in (xxx) xxx-xxxx format, but the replacing is tough. Maybe RegExp?

    select userid, bio

    from users

    where bio like ('%([0-9][0-9][0-9])%')

    Utilizing a CLR that uses the .NET regex.split method you could do this:

    --The solution

    DECLARE @x nvarchar(1000)=

    N'This is my number: (312)555-0000 or (312) 555-1111... You can also '+

    'reach me at 312.555.2133 or 222 333 4444 or 555-333-1111 or 5551234567! '+

    ' in case you were wondering my userid is 555333-b000 and password is 123456';

    WITH

    pat AS (

    SELECTSequence as ItemNumber,

    Token as Item, IsValid AS [Matched]

    FROM mdq.RegexSplit(

    @x,

    '(.{0,1}\d{3}.{0,2}\d{3}.{0,1}\d{4})',

    '.{0,1}\d{3}.{0,2}\d{3}.{0,1}\d{4}', 0))

    SELECT

    CAST((

    SELECTCASE [matched]

    WHEN 0 THEN Item

    ELSE REPLACE(REPLACE(REPLACE(REPLACE

    (REPLACE(REPLACE(REPLACE(REPLACE

    (REPLACE(REPLACE(Item,1,'X'),2,'X')

    ,3,'X'),4,'X'),5,'X'),6,'X'),7,'X'),

    8,'X'),9,'X'),0,'X')

    END

    FROM pat

    ORDER BY ItemNumber

    FOR XML PATH('')) AS varchar(1000))

    GO

    Which produces this:

    This is my number: (XXX)XXX-XXXX or (XXX) XXX-XXXX... You can also reach me at XXX.XXX.XXXX or XXX XXX XXXX or XXX-XXX-XXXX or XXXXXXXXXX! in case you were wondering my userid is 555333-b000 and password is 123456

    Edit: added a couple more phone number formats

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks Alan! You rock.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply