Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Tally table ASCII validator - Anyone ever did this? Expand / Collapse
Author
Message
Posted Thursday, February 18, 2010 7:55 AM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Thursday, September 19, 2013 6:00 AM
Points: 715, Visits: 2,706
Hi all,

I'm trying to validate a database data, as we found some unknown characters in the descriptions the vendors send us. I'm trying to use a tally table, to find the unknown characters (under ascii(32) or over ascii(151)).

This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly... The table I will check afterwards with this code contains about 1 million description, in french and English.

Anyone care to take a look?

DECLARE @test TABLE ( value VARCHAR(50) )

INSERT INTO @test
( value
)
SELECT 'This is a test, no errors'
UNION ALL
SELECT 'ÿ ascii(152) must be trapped'
UNION ALL
SELECT '▼Unknown 31 ascii character must be trapped'
UNION ALL
SELECT 'ö ascii(148) must not be trapped' ;
WITH cteTally
AS ( SELECT row_number() OVER ( ORDER BY o.ID ) AS N
FROM master.sys.sysobjects o
CROSS JOIN master.sys.sysobjects o2
),
ctedesc
AS ( SELECT ASCII(SUBSTRING(t.value, n, 1)) AS [ascii] ,
SUBSTRING(t.value, n, 1) AS Ch
FROM @test t
INNER JOIN cteTally ON n <= LEN(t.value)
)
SELECT *
FROM ctedesc
WHERE [ascii] < 32 -- Anything under 32 must be trapped
OR [ascii] > 151 -- Anything over 151 must be trapped



If any of you already done something like this, please give me some pointers,

Thanks in advance,


Cheers,

J-F
Post #868056
Posted Thursday, February 18, 2010 8:24 AM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Today @ 6:53 AM
Points: 12,890, Visits: 31,851
J-F, to find non-compliant data with high ascii, i think you can just use this:
select * from @test where value like  '%[^''a-Z .,-]%'

i think you'd use a Tally table if you were stripping out the offending chars;

here's some test code and a Tally function i just slapped together:
select * from @test where value like  '%[^''a-Z .,-]%'
select * from @test where value like '%[^' + CHAR(128) + '-' + CHAR(255) + ']%'
select dbo.StripHighAscii(value) from @test

with a tally table: this version strips out any non-expected characters also though, not just high ascii, but it's obvious what you can change:
ALTER FUNCTION StripHighAscii(@OriginalText VARCHAR(max))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(max)
;WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) IN (9,10,13) THEN SUBSTRING(@OriginalText,Tally.N,1) --tab,lf,cr
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 32 AND 127 THEN SUBSTRING(@OriginalText,Tally.N,1)
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) <= 32 THEN ''
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) >= 128 THEN ''
END
FROM Tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText
END

another saved StripHighAscii for reference:
create function [dbo].[StripHighAscii](@originaltext varchar(8000))  
returns varchar(8000)
begin

declare @i int
set @i=127
while @i < 255
begin
SET @originaltext = replace(@originaltext,CHAR(@i),'')
SET @i=@i +1
end
return @originaltext
end



Lowell

--There is no spoon, and there's no default ORDER BY in sql server either.
Actually, Common Sense is so rare, it should be considered a Superpower. --my son
Post #868084
Posted Thursday, February 18, 2010 8:42 AM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Thursday, September 19, 2013 6:00 AM
Points: 715, Visits: 2,706
Nice Lowell, very interesting stripping function you got there, thanks for sharing.

I'm not sure about the "like" you have used, I use like a lot, but never used it to generate some kind of "normal expression" like you did. I will have to look more closely as what characters I need to use to ensure my data is correct.

The thing is, yes I want to eventually strip the unwanted characters, but I need to identify them first, and having the whole string alltogether sometimes makes it hard to see characters that are not even visible to me (i.e. char(160)). We had a weird behavior when sending the information, and we need to detect them.

I will look at the normal expression you can create with the like operator, it is probably going to help.

Thanks again,


Cheers,

J-F
Post #868094
Posted Thursday, February 18, 2010 8:47 AM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Thursday, September 19, 2013 6:00 AM
Points: 715, Visits: 2,706
I have a weird behavior when using your stripping function. Seems the weird triangle in the third example translates to a question mark... I'm not sure of that behavior, maybe the character translates to 2 bytes?

Cheers,

J-F
Post #868101
Posted Thursday, February 18, 2010 11:13 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Yesterday @ 5:24 PM
Points: 1,595, Visits: 6,615
This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly...


The reason that it's so slow is because you are generating a very large tally table (~3 million rows) on the fly. You only need a tally table with numbers up to 50 (the length of the value column).

Use TOP 50 to limit the size of the tally table. No need for a CROSS JOIN also.
...
WITH cteTally
AS ( SELECT TOP 50 row_number() OVER ( ORDER BY o.ID ) AS N
FROM master.sys.sysobjects o
),
...

Performance goes from 13sec to 0sec on my machine.
Post #868314
Posted Thursday, February 18, 2010 2:01 PM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Thursday, September 19, 2013 6:00 AM
Points: 715, Visits: 2,706
Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

Thanks again,


Cheers,

J-F
Post #868507
Posted Thursday, February 18, 2010 10:05 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 8:58 PM
Points: 36,794, Visits: 31,253
J-F Bergeron (2/18/2010)
Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

Thanks again,


Lowell's code does that correctly.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #868758
Posted Friday, February 19, 2010 3:26 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Yesterday @ 5:24 PM
Points: 1,595, Visits: 6,615
Jeff Moden (2/18/2010)
...

Lowell's code does that correctly.

Well, not exactly. The second version of StripHighAscii (the one with the loop) can give you very odd results depending on the collation being used. I use collation Latin1_general_CI_AS and
select dbo.StripHighAscii('a123²³')

give me 'a' as a result, because the replace function does a case insensitive comparision. This can be fixed by changing the collation of the search character
...
SET @originaltext = replace(@originaltext,CHAR(@i) collate Latin1_General_BIN,'')
...

Check out the following script to see what happens when using a certain collation
with Tally as
(
select top 256
number - 1 N
from
master..spt_values
where
type = 'P'
)
select
t1.N, char(t1.N), t2.N, char(t2.N)
from
Tally t1
cross join
Tally t2
where
t1.N >= 128
and t2.N < 128
and char(t1.N) = char(t2.N)

Peter
Post #868899
Posted Friday, February 19, 2010 3:35 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Yesterday @ 5:24 PM
Points: 1,595, Visits: 6,615
BTW, the same holds true for the LIKE method

declare @test table(value varchar(50))
insert into @test(value) values('a123²³')
select * from @test where value like '%[^''0-9a-Z .,-]%'
select * from @test where value like '%[^''0-9a-Z .,-]%' collate Latin1_general_BIN

Peter
Post #868905
Posted Friday, February 19, 2010 3:46 AM
SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Yesterday @ 5:24 PM
Points: 1,595, Visits: 6,615
REMOVED: false statement
Post #868909
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse