Tally table ASCII validator - Anyone ever did this?

Question

Post reply

Tally table ASCII validator - Anyone ever did this?

J-F Bergeron

SSCrazy Eights

Points: 8593
More actions
February 18, 2010 at 7:55 am

#386335

Hi all,
I'm trying to validate a database data, as we found some unknown characters in the descriptions the vendors send us. I'm trying to use a tally table, to find the unknown characters (under ascii(32) or over ascii(151)).
This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly... The table I will check afterwards with this code contains about 1 million description, in french and English.
Anyone care to take a look?
DECLARE @test-2 TABLE ( value VARCHAR(50) )
INSERT INTO @test-2
( value
)
SELECT 'This is a test, no errors'
UNION ALL
SELECT 'ÿ ascii(152) must be trapped'
UNION ALL
SELECT '▼Unknown 31 ascii character must be trapped'
UNION ALL
SELECT 'ö ascii(148) must not be trapped' ;
WITH cteTally
AS ( SELECT row_number() OVER ( ORDER BY o.ID ) AS N
FROM master.sys.sysobjects o
CROSS JOIN master.sys.sysobjects o2
),
ctedesc
AS ( SELECT ASCII(SUBSTRING(t.value, n, 1)) AS [ascii] ,
SUBSTRING(t.value, n, 1) AS Ch
FROM @test-2 t
INNER JOIN cteTally ON n <= LEN(t.value)
)
SELECT *
FROM ctedesc
WHERE [ascii] < 32 -- Anything under 32 must be trapped
OR [ascii] > 151 -- Anything over 151 must be trapped

If any of you already done something like this, please give me some pointers,
Thanks in advance,
Cheers,
J-F

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Lowell SSC Guru Points: 323504 More actions · Answer 1

J-F, to find non-compliant data with high ascii, i think you can just use this:

select * from @test-2 where value like '%[^''a-Z .,-]%'

i think you'd use a Tally table if you were stripping out the offending chars;

here's some test code and a Tally function i just slapped together:

select * from @test-2 where value like '%[^''a-Z .,-]%'

select * from @test-2 where value like '%[^' + CHAR(128) + '-' + CHAR(255) + ']%'

select dbo.StripHighAscii(value) from @test-2

with a tally table: this version strips out any non-expected characters also though, not just high ascii, but it's obvious what you can change:

ALTER FUNCTION StripHighAscii(@OriginalText VARCHAR(max))

RETURNS VARCHAR(8000)

BEGIN

DECLARE @CleanedText VARCHAR(max)

;WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

SELECT @CleanedText = ISNULL(@CleanedText,'') +

CASE

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) IN (9,10,13) THEN SUBSTRING(@OriginalText,Tally.N,1) --tab,lf,cr

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 32 AND 127 THEN SUBSTRING(@OriginalText,Tally.N,1)

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) <= 32 THEN ''

WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) >= 128 THEN ''

END

FROM Tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText

END

another saved StripHighAscii for reference:

create function [dbo].[StripHighAscii](@originaltext varchar(8000))

returns varchar(8000)

begin

declare @i int

set @i=127

while @i < 255

begin

SET @originaltext = replace(@originaltext,CHAR(@i),'')

SET @i=@i +1

end

return @originaltext

end

Lowell

--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

J-F Bergeron SSCrazy Eights Points: 8593 More actions · Answer 2

Nice Lowell, very interesting stripping function you got there, thanks for sharing.

I'm not sure about the "like" you have used, I use like a lot, but never used it to generate some kind of "normal expression" like you did. I will have to look more closely as what characters I need to use to ensure my data is correct.

The thing is, yes I want to eventually strip the unwanted characters, but I need to identify them first, and having the whole string alltogether sometimes makes it hard to see characters that are not even visible to me (i.e. char(160)). We had a weird behavior when sending the information, and we need to detect them.

I will look at the normal expression you can create with the like operator, it is probably going to help.

Thanks again,

Cheers,

J-F

J-F Bergeron SSCrazy Eights Points: 8593 More actions · Answer 3

I have a weird behavior when using your stripping function. Seems the weird triangle in the third example translates to a question mark... I'm not sure of that behavior, maybe the character translates to 2 bytes?

Cheers,

J-F

matzer SSCrazy Eights Points: 9727 More actions · Answer 4

This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly...

The reason that it's so slow is because you are generating a very large tally table (~3 million rows) on the fly. You only need a tally table with numbers up to 50 (the length of the value column).

Use TOP 50 to limit the size of the tally table. No need for a CROSS JOIN also.

...

WITH cteTally

AS ( SELECT TOP 50 row_number() OVER ( ORDER BY o.ID ) AS N

FROM master.sys.sysobjects o

),

...

Performance goes from 13sec to 0sec on my machine.

J-F Bergeron SSCrazy Eights Points: 8593 More actions · Answer 5

Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

Thanks again,

Cheers,

J-F

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 6

J-F Bergeron (2/18/2010)
Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.
By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.
Thanks again,

Lowell's code does that correctly.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

matzer SSCrazy Eights Points: 9727 More actions · Answer 7

Jeff Moden (2/18/2010)
...
Lowell's code does that correctly.

Well, not exactly. The second version of StripHighAscii (the one with the loop) can give you very odd results depending on the collation being used. I use collation Latin1_general_CI_AS and

select dbo.StripHighAscii('a123²³') give me 'a' as a result, because the replace function does a case insensitive comparision. This can be fixed by changing the collation of the search character

...

SET @originaltext = replace(@originaltext,CHAR(@i) collate Latin1_General_BIN,'')

...

Check out the following script to see what happens when using a certain collation

with Tally as

(

select top 256

number - 1 N

from

master..spt_values

where

type = 'P'

)

select

t1.N, char(t1.N), t2.N, char(t2.N)

from

Tally t1

cross join

Tally t2

where

t1.N >= 128

and t2.N < 128

and char(t1.N) = char(t2.N)

Peter

matzer SSCrazy Eights Points: 9727 More actions · Answer 8

BTW, the same holds true for the LIKE method

declare @test-2 table(value varchar(50))

insert into @test-2(value) values('a123²³')

select * from @test-2 where value like '%[^''0-9a-Z .,-]%'

select * from @test-2 where value like '%[^''0-9a-Z .,-]%' collate Latin1_general_BIN

Peter

matzer SSCrazy Eights Points: 9727 More actions · Answer 9

matzer

SSCrazy Eights

Points: 9727

February 19, 2010 at 3:46 am

#1121690

REMOVED: false statement

Paul White SSC Guru Points: 150467 More actions · Answer 10

My take on Lowell's good contribution earlier.

Lowell, I hope you don't mind 😎

ALTER FUNCTION dbo.StripHighAscii(@OriginalText VARCHAR(MAX))

RETURNS TABLE

WITH SCHEMABINDING

AS

RETURN

(

WITH

L0 AS (SELECT 1 AS C UNION ALL SELECT 1),

L1 AS (SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B),

L2 AS (SELECT 1 AS C FROM L1 AS A CROSS JOIN L1 AS B),

L3 AS (SELECT 1 AS C FROM L2 AS A CROSS JOIN L2 AS B),

L4 AS (SELECT 1 AS C FROM L3 AS A CROSS JOIN L3 AS B),

L5 AS (SELECT 1 AS C FROM L4 AS A CROSS JOIN L4 AS B),

Numbers

AS (

-- Just the sequence numbers we need, in the right order

SELECT TOP (DATALENGTH(@OriginalText))

ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS N

FROM L5

ORDER BY

N

),

Cleaned(cleaned_text)

AS (

SELECT CASE

WHEN ASCII(SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1)) BETWEEN 32 AND 127 THEN SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1)

WHEN ASCII(SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1)) IN (9, 10, 13) THEN SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1) --TAB,LF,CR

ELSE ''

END

FROM Numbers

FOR XML PATH (''), TYPE

)

SELECT cleaned_text = cleaned_text.value('.[1]', 'VARCHAR(MAX)')

FROM Cleaned

)

GO

DECLARE @ToClean

TABLE (data VARCHAR(MAX) NOT NULL);

-- Test data

INSERT @ToClean (data) VALUES ('This is a test, no errors');

INSERT @ToClean (data) VALUES ('ÿ ascii(152) must be trapped');

INSERT @ToClean (data) VALUES (CHAR(31) + 'Unknown 31 ascii character must be trapped');

INSERT @ToClean (data) VALUES ('ö ascii(148) must not be trapped');

INSERT @ToClean (data) VALUES ('a123²³');

-- Cleaning

SELECT SHA.cleaned_text

FROM @ToClean

CROSS

APPLY dbo.StripHighAscii([@ToClean].data) SHA;

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi