Is there a way to extract only credit card numbers from text

  • i have below sample data, i try to extract valid credit card data. But i was not able to do so ,since data is not in good format. Can any one help?

    below sample data have a transaction code which is also 16 numbers or more which should be considered as credit card number

    DECLARE @test-2 TABLE (memo varchar(400) );

    INSERT INTO @test-2

    SELECT 'abc this account used visa :1234567890123456' UNION ALL

    SELECT 'this account visa 1234567890123456 request for receipt' UNION ALL

    SELECT 'this account visa 123456789012345612/13 exp' UNION ALL

    SELECT 'invoice card used 4123 4567 8901 2345 some text' UNION ALL

    SELECT 'paid using 4123 4567 8901 2345 thank you' UNION ALL

    SELECT 'Transaction code 123456454636667445 copy invoice'UNION ALL

    SELECT 'mastercard cc#5987-6543-2109-8765. this order will be processed' UNION ALL

    SELECT 'mastercard 5987-6543-2109-8765, thank you' union all

    SELECT ' period 1999-10-18 - -2000-10-17 '; ;

    SELECT *

    FROM @test-2

    WHERE --master or visa 16

    patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]%', memo) <>0

    --visa 13

    or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0

    -- amex 15

    or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9] [0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9]%', memo) <>0

  • PATINDEX gives you the position of the start of the match. So just SUBSTRING them out. The sneaky part here is doing it in a single pass. Once again my favorite TSQL thingy comes into play - CASE.

    SELECT CASE WHEN PATINDEX( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <> 0 THEN

    SUBSTRING(memo, PATINDEX( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo), 16)

    WHEN patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo) <> 0 THEN

    SUBSTRING(memo, patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo), 19)

    --the rest are left to the inquisitive reader :)

    END AS CCNum

    FROM @test-2

    That code isn't quite complete but I will let you flesh it out. You will need to test the character after the last digit in the sequence (only) to see if it is a number in order to determine if it is a 16 VISA/MC or 15 AMEX. I don't know of a VISA that is just 13 but same thing applies there.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • This might give invalid card numbers, but it's worth a try.

    WITH ctePatterns AS(

    SELECT *

    FROM (VALUES

    ( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', 16)

    ,( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', 19)

    ,( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]%', 19)

    --visa 13

    ,( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', 13)

    -- amex 15

    ,( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', 15)

    ,( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9] [0-9][0-9][0-9]%', 18)

    ,( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9]%', 18) )p(Pattern, Length)

    )

    SELECT t.memo, SUBSTRING( t.memo, PATINDEX( p.Pattern, t.memo), p.Length)

    FROM @test-2 t

    CROSS APPLY (SELECT TOP 1 *

    FROM ctePatterns p

    WHERE t.memo LIKE p.Pattern

    ORDER BY Length DESC)p;

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Hey thanks for your reply, another issue i have is i dont want to transaction ID which is not a credit card number it has more than 16 numbers.

  • How can you differentiate it from the visa card that has the expiration date together?

    Here's another option that works with your sample data. The function is posted and explained in here: http://www.sqlservercentral.com/articles/String+Manipulation/94365/

    SELECT memo,

    CASE WHEN Item LIKE '%/__' THEN LEFT( Item, 16)

    ELSE Item END AS ccard

    FROM @test-2

    CROSS APPLY dbo.PatternSplitCM( REPLACE( REPLACE( memo, ' ', ''), '-', ''), '%[0-9/]%')x

    WHERE Matched = 1

    AND( LEN(Item) <= 16

    OR Item LIKE '%/__')

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • thanks luis for your reply i was testing your solution and found there are more scenarios in my bad data. the system i am working on SQL 2005 so change function as below using a tally table.

    DECLARE @StartTime DATETIME --Timer to measure total duration

    SET @StartTime = GETDATE() --Start the timer

    --=============================================================================

    -- Create and populate a Tally table

    --=============================================================================

    --===== Conditionally drop

    IF OBJECT_ID('dbo.Tally') IS NOT NULL

    DROP TABLE dbo.Tally

    --===== Create and populate the Tally table on the fly

    SELECT TOP 11000000 --equates to more than 30 years of dates

    IDENTITY(INT,1,1) AS N

    INTO dbo.Tally

    FROM Master.dbo.SysColumns sc1,

    Master.dbo.SysColumns sc2

    --===== Add a Primary Key to maximize performance

    ALTER TABLE dbo.Tally

    ADD CONSTRAINT PK_Tally_N

    PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

    --===== Let the public use it

    GRANT SELECT, REFERENCES ON dbo.Tally TO PUBLIC

    --===== Display the total duration

    SELECT STR(DATEDIFF(ms,@StartTime,GETDATE())) + ' Milliseconds duration'

    create FUNCTION [dbo].[PatternSplitCM]

    (

    @List VARCHAR(8000) = NULL

    ,@Pattern VARCHAR(50)

    ) RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    WITH numbers AS (

    SELECT TOP(ISNULL(DATALENGTH(@List), 0))

    n

    FROM dbo.Tally t

    WHERE N <= LEN(@List)+1)

    SELECT

    ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),

    Item = SUBSTRING(@List,MIN(n),1+MAX(n)-MIN(n)),

    [Matched]

    FROM (

    SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER(ORDER BY y.[Matched],n)

    FROM numbers

    CROSS APPLY (

    SELECT [Matched] = CASE WHEN SUBSTRING(@List,n,1) LIKE @Pattern THEN 1 ELSE 0 END

    ) y

    ) d

    GROUP BY [Matched], Grouper

    As you were how do you know whether the number is a true credit card or not, i was reading one of the article in the forum where it is mentioned all visa cards will start with number sequence 4(http://www.sqlservercentral.com/Forums/Topic1437053-391-1.aspx). i was think adding this additional filter will retrieve correct results.

    If you look at the test data i added just before posting this thread where a dob is showing in the result because it matches with the pattern.

  • You could have also changed it like this and keep it as a function with zero reads.

    CREATE FUNCTION [dbo].[PatternSplitCM](@List [varchar](8000) = NULL, @Pattern [varchar](50))

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    WITH cterows AS(

    SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL

    SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n UNION ALL SELECT 0 n

    ),

    numbers AS (

    SELECT TOP(ISNULL(DATALENGTH(@List), 0))

    n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

    FROM cterows d, cterows e, cterows f, cterows g)

    SELECT

    ItemNumber = ROW_NUMBER() OVER(ORDER BY MIN(n)),

    Item = SUBSTRING(@List,MIN(n),1+MAX(n)-MIN(n)),

    [Matched]

    FROM (

    SELECT n, y.[Matched], Grouper = n - ROW_NUMBER() OVER(ORDER BY y.[Matched],n)

    FROM numbers

    CROSS APPLY (

    SELECT [Matched] = CASE WHEN SUBSTRING(@List,n,1) LIKE @Pattern COLLATE Latin1_General_Bin THEN 1 ELSE 0 END

    ) y

    ) d

    GROUP BY [Matched], Grouper;

    It depends on what resource you want to use (RAM or disk)

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • mxy (12/1/2016)


    i have below sample data, i try to extract valid credit card data. But i was not able to do so ,since data is not in good format. Can any one help?

    below sample data have a transaction code which is also 16 numbers or more which should be considered as credit card number

    DECLARE @test-2 TABLE (memo varchar(400) );

    INSERT INTO @test-2

    SELECT 'abc this account used visa :1234567890123456' UNION ALL

    SELECT 'this account visa 1234567890123456 request for receipt' UNION ALL

    SELECT 'this account visa 123456789012345612/13 exp' UNION ALL

    SELECT 'invoice card used 4123 4567 8901 2345 some text' UNION ALL

    SELECT 'paid using 4123 4567 8901 2345 thank you' UNION ALL

    SELECT 'Transaction code 123456454636667445 copy invoice'UNION ALL

    SELECT 'mastercard cc#5987-6543-2109-8765. this order will be processed' UNION ALL

    SELECT 'mastercard 5987-6543-2109-8765, thank you' union all

    SELECT ' period 1999-10-18 - -2000-10-17 '; ;

    SELECT *

    FROM @test-2

    WHERE --master or visa 16

    patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]%', memo) <>0

    --visa 13

    or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0

    -- amex 15

    or patindex( '%[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9] [0-9][0-9][0-9][0-9][0-9][0-9] [0-9][0-9] [0-9][0-9][0-9]%', memo) <>0

    or patindex( '%[0-9][0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9]%', memo) <>0

    You might have a different problem all together... Last I knew, financial regulations ABSOLUTELY REQUIRE that credit card numbers are stored encrypted at ALL TIMES. Hopefully, the numbers posted are entirely made up samples, and the field in question is stored encrypted.

    As to the problem field, that's just a really incredibly bad idea, especially given current financial regulations. Having to try and parse that kind of data from some kind of memo field that might not get the appropriate protection is probably gross negligence, at the very least. At best, it's encrypted, but it's still a bad idea. Having to parse out that kind of information with no consistency to formatting is going to be prone to failure, because there just can't be an automated pattern that can't be fooled by bad data.

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • this is a legacy system i am working , trying to mask all the credit card data that is in plain text. But the challenge itself is the way data is stored i am not able to find a way to display only credit card data.

  • SQL might not be the best tool for this.

    You should look at regular expressions processing in another language, they are usually a lot more robust that what Patindex allows, and you can find a lot of samples online.

  • Manic Star (12/1/2016)


    SQL might not be the best tool for this.

    You should look at regular expressions processing in another language, they are usually a lot more robust that what Patindex allows, and you can find a lot of samples online.

    I'd be surprised if regex would accurately help much here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mxy (12/1/2016)


    this is a legacy system i am working , trying to mask all the credit card data that is in plain text. But the challenge itself is the way data is stored i am not able to find a way to display only credit card data.

    Pattern matching for 15 or 16 digits is just the start here (not to mention that it's also an incorrect assumption) and the pattern matching can be made simpler by changing all dashes and spaces to empty strings before you make the extraction attempt. Once you've extracted such digits from the fray of the rest of the text, you then have to test for several things to ensure that it's actually a credit card number. Those steps would be as follows...

    1. Extract the credit card type from the leading digits of the number you've extracted.

    2. Ensure that the credit card type is a currently active credit card type.

    3. Knowing the credit card type, validate the length of the number against all the possible length that credit card type can have.

    4. Perform a LUHN 10 Checksum validation and compare it against the check-digit of the credit card number, which is the last digit of the credit card type.

    Although unlikely, it will NOT guarantee that the transaction number will fail the credit card test above. You could use other hints in the text to do that. For example, if the credit card type digits indicate a Master Card and the word "Master" is in the text, that's a pretty strong indication that it's actually a credit card number that you've extracted. The appearance of "CC#" would be another strong indication of it being a credit card number.

    Yes, you could probably do all of this with RegEx but I suspect it will be a whole lot slower than what you might be able to do in T-SQL. Why do I say that? Because here's a list of the credit card type and length rules.

    https://en.wikipedia.org/wiki/Payment_card_number

    Here's an explanation of the LUHN 10 Check Sum.

    https://en.wikipedia.org/wiki/Luhn_algorithm

    A very important part of calculating LUHN 10 Checksums is the understanding that padding with leading zeros does NOT change the outcome. If you always left zero pad out to 20 digits, it will make calculating the LUHN 10 Check Digit a whole lot easier than doing a REVERSE and trying to calculate from the right. It will also guaranteed that all credit card numbers are handled exactly the same way regardless of length, which will help in areas such as parameter sniffing and single execution plan reuse.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQL CLR Regex would be the way I would go honestly. IIRC someone actually already built a regular expression system for SQL Server using that.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (12/1/2016)


    SQL CLR Regex would be the way I would go honestly. IIRC someone actually already built a regular expression system for SQL Server using that.

    A lot of folks say that but it's been my experience in many posts that measured performance of different methods that if it can be done using T-SQL, it will beat SQLCLR REGEX. As with any experience, there are always exceptions. I'll see if I can find a couple of the threads where such testing occurred. There was one really big one that I have in mind but don't have the URL for it handy.

    This is a DR/BCP weekend for me so I might not be able to get to this thread this weekend but I'll try.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SQLsharp has a validation for credit cards. It will only validate a number and won't extract it from a string. If you want to try it, feel free to do it.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply