Technical Article

Identify Bad Characters in a Table Column

,

This script uses a recursive CTE to identify non-allowed characters from a column in a table.  In this case, the PATINDEX is set up to consider alphanumerics as the only valid characters.

You may need to set the MAX RECURSION limit if you think there will be > 100 invalid characters in any particular string.

I have tested this script using a 100,000 row test harness on 100 character strings salted with about 20-25% non-alphanumeric characters and it ran in about 28-45 seconds.  1,000,000 rows ran in about 5-6 minutes.  Variability was probably caused by random invalid characters being assigned to each string.

[Note to editor] Sorry but I couldn't figure out how to get my script into the field for it on this page. 

CREATE TABLE #t (strings VARCHAR(100));

CREATE TABLE #b (bad CHAR(1));

DECLARE @allowed VARCHAR(30)

SET @allowed = '%[^a-zA-Z0-9]%' -- Pattern to allow only alphanumerics

DECLARE @notallowed CHAR(1)

SET @notallowed = '*' -- Any non-allowed character

INSERT INTO #t (strings)

SELECT 'AABA#24$2@'

UNION ALL SELECT 'AAcc/24)2('

UNION ALL SELECT 'ABC123'

UNION ALL SELECT 'AA%(&_#24$2@'

UNION ALL SELECT 'AA+JDLASD4)2('

UNION ALL SELECT 'ABC123'

UNION ALL SELECT ''

UNION ALL SELECT '&'

UNION ALL SELECT '****'

-- Identify the bad characters

;WITH BadStrings (strings) AS (

SELECT strings+@notallowed 

FROM #t

WHERE strings LIKE @allowed

)

,Parser AS (

SELECT SUBSTRING(strings, PATINDEX(@allowed,strings), 1) as bad

,SUBSTRING(strings, 1+PATINDEX(@allowed,strings),LEN(strings)) as rest

FROM BadStrings

UNION ALL

SELECT SUBSTRING(rest, PATINDEX(@allowed,rest), 1) 

,SUBSTRING(rest, 1+PATINDEX(@allowed,rest), LEN(rest))

FROM Parser WHERE LEN(rest) > 0 and 

SUBSTRING(rest, 1+PATINDEX(@allowed,rest), LEN(rest)) <> ''

)

SELECT DISTINCT bad 

FROM Parser b

DROP TABLE #t

DROP TABLE #b

Rate

3.75 (4)

Share

Share

Rate

3.75 (4)