Technical Article

Identify Bad Characters in a Table Column

,

This script uses a recursive CTE to identify non-allowed characters from a column in a table.  In this case, the PATINDEX is set up to consider alphanumerics as the only valid characters.

You may need to set the MAX RECURSION limit if you think there will be > 100 invalid characters in any particular string.

I have tested this script using a 100,000 row test harness on 100 character strings salted with about 20-25% non-alphanumeric characters and it ran in about 28-45 seconds.  1,000,000 rows ran in about 5-6 minutes.  Variability was probably caused by random invalid characters being assigned to each string.

[Note to editor] Sorry but I couldn't figure out how to get my script into the field for it on this page. 

CREATE TABLE #t (strings VARCHAR(100));

CREATE TABLE #b (bad CHAR(1));

DECLARE @allowed VARCHAR(30)

SET @allowed = '%[^a-zA-Z0-9]%' -- Pattern to allow only alphanumerics

DECLARE @notallowed CHAR(1)

SET @notallowed = '*' -- Any non-allowed character

INSERT INTO #t (strings)

SELECT 'AABA#24$2@'

UNION ALL SELECT 'AAcc/24)2('

UNION ALL SELECT 'ABC123'

UNION ALL SELECT 'AA%(&_#24$2@'

UNION ALL SELECT 'AA+JDLASD4)2('

UNION ALL SELECT 'ABC123'

UNION ALL SELECT ''

UNION ALL SELECT '&'

UNION ALL SELECT '****'

-- Identify the bad characters

;WITH BadStrings (strings) AS (

SELECT strings+@notallowed 

FROM #t

WHERE strings LIKE @allowed

)

,Parser AS (

SELECT SUBSTRING(strings, PATINDEX(@allowed,strings), 1) as bad

,SUBSTRING(strings, 1+PATINDEX(@allowed,strings),LEN(strings)) as rest

FROM BadStrings

UNION ALL

SELECT SUBSTRING(rest, PATINDEX(@allowed,rest), 1) 

,SUBSTRING(rest, 1+PATINDEX(@allowed,rest), LEN(rest))

FROM Parser WHERE LEN(rest) > 0 and 

SUBSTRING(rest, 1+PATINDEX(@allowed,rest), LEN(rest)) <> ''

)

SELECT DISTINCT bad 

FROM Parser b

DROP TABLE #t

DROP TABLE #b

Read 719 times
(2 in last 30 days)

Rate

3.75 (4)

You rated this post out of 5. Change rating

Share

Share

Rate

3.75 (4)

You rated this post out of 5. Change rating