SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Tally table ASCII validator - Anyone ever did this?


Tally table ASCII validator - Anyone ever did this?

Author
Message
J-F Bergeron
J-F Bergeron
SSCrazy
SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)

Group: General Forum Members
Points: 2079 Visits: 2707
Hi all,

I'm trying to validate a database data, as we found some unknown characters in the descriptions the vendors send us. I'm trying to use a tally table, to find the unknown characters (under ascii(32) or over ascii(151)).

This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly... The table I will check afterwards with this code contains about 1 million description, in french and English.

Anyone care to take a look?

DECLARE @test TABLE ( value VARCHAR(50) )

INSERT INTO @test
( value
)
SELECT 'This is a test, no errors'
UNION ALL
SELECT 'ÿ ascii(152) must be trapped'
UNION ALL
SELECT '▼Unknown 31 ascii character must be trapped'
UNION ALL
SELECT 'ö ascii(148) must not be trapped' ;
WITH cteTally
AS ( SELECT row_number() OVER ( ORDER BY o.ID ) AS N
FROM master.sys.sysobjects o
CROSS JOIN master.sys.sysobjects o2
),
ctedesc
AS ( SELECT ASCII(SUBSTRING(t.value, n, 1)) AS [ascii] ,
SUBSTRING(t.value, n, 1) AS Ch
FROM @test t
INNER JOIN cteTally ON n <= LEN(t.value)
)
SELECT *
FROM ctedesc
WHERE [ascii] < 32 -- Anything under 32 must be trapped
OR [ascii] > 151 -- Anything over 151 must be trapped





If any of you already done something like this, please give me some pointers,

Thanks in advance,

Cheers,

J-F
Lowell
Lowell
SSC Guru
SSC Guru (70K reputation)SSC Guru (70K reputation)SSC Guru (70K reputation)SSC Guru (70K reputation)SSC Guru (70K reputation)SSC Guru (70K reputation)SSC Guru (70K reputation)SSC Guru (70K reputation)

Group: General Forum Members
Points: 70068 Visits: 40923
J-F, to find non-compliant data with high ascii, i think you can just use this:

select * from @test where value like '%[^''a-Z .,-]%'



i think you'd use a Tally table if you were stripping out the offending chars;

here's some test code and a Tally function i just slapped together:

select * from @test where value like '%[^''a-Z .,-]%'
select * from @test where value like '%[^' + CHAR(128) + '-' + CHAR(255) + ']%'
select dbo.StripHighAscii(value) from @test


with a tally table: this version strips out any non-expected characters also though, not just high ascii, but it's obvious what you can change:

ALTER FUNCTION StripHighAscii(@OriginalText VARCHAR(max))
RETURNS VARCHAR(8000)
BEGIN
DECLARE @CleanedText VARCHAR(max)
;WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows
E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows
E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows
E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows
E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need
Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)
SELECT @CleanedText = ISNULL(@CleanedText,'') +
CASE
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) IN (9,10,13) THEN SUBSTRING(@OriginalText,Tally.N,1) --tab,lf,cr
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 32 AND 127 THEN SUBSTRING(@OriginalText,Tally.N,1)
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) <= 32 THEN ''
WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) >= 128 THEN ''
END
FROM Tally WHERE Tally.N <= LEN(@OriginalText)

RETURN @CleanedText
END



another saved StripHighAscii for reference:

create function [dbo].[StripHighAscii](@originaltext varchar(8000))
returns varchar(8000)
begin

declare @i int
set @i=127
while @i < 255
begin
SET @originaltext = replace(@originaltext,CHAR(@i),'')
SET @i=@i +1
end
return @originaltext
end



Lowell
--help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!
J-F Bergeron
J-F Bergeron
SSCrazy
SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)

Group: General Forum Members
Points: 2079 Visits: 2707
Nice Lowell, very interesting stripping function you got there, thanks for sharing.

I'm not sure about the "like" you have used, I use like a lot, but never used it to generate some kind of "normal expression" like you did. I will have to look more closely as what characters I need to use to ensure my data is correct.

The thing is, yes I want to eventually strip the unwanted characters, but I need to identify them first, and having the whole string alltogether sometimes makes it hard to see characters that are not even visible to me (i.e. char(160)). We had a weird behavior when sending the information, and we need to detect them.

I will look at the normal expression you can create with the like operator, it is probably going to help.

Thanks again,

Cheers,

J-F
J-F Bergeron
J-F Bergeron
SSCrazy
SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)

Group: General Forum Members
Points: 2079 Visits: 2707
I have a weird behavior when using your stripping function. Seems the weird triangle in the third example translates to a question mark... I'm not sure of that behavior, maybe the character translates to 2 bytes?

Cheers,

J-F
Peter Brinkhaus
Peter Brinkhaus
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3069 Visits: 7404
This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly...


The reason that it's so slow is because you are generating a very large tally table (~3 million rows) on the fly. You only need a tally table with numbers up to 50 (the length of the value column).

Use TOP 50 to limit the size of the tally table. No need for a CROSS JOIN also.

...
WITH cteTally
AS ( SELECT TOP 50 row_number() OVER ( ORDER BY o.ID ) AS N
FROM master.sys.sysobjects o
),
...



Performance goes from 13sec to 0sec on my machine.
J-F Bergeron
J-F Bergeron
SSCrazy
SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)SSCrazy (2.1K reputation)

Group: General Forum Members
Points: 2079 Visits: 2707
Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

Thanks again,

Cheers,

J-F
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)SSC Guru (210K reputation)

Group: General Forum Members
Points: 210489 Visits: 41973
J-F Bergeron (2/18/2010)
Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

Thanks again,


Lowell's code does that correctly.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Peter Brinkhaus
Peter Brinkhaus
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3069 Visits: 7404
Jeff Moden (2/18/2010)
...

Lowell's code does that correctly.

Well, not exactly. The second version of StripHighAscii (the one with the loop) can give you very odd results depending on the collation being used. I use collation Latin1_general_CI_AS and
select dbo.StripHighAscii('a123²³')

give me 'a' as a result, because the replace function does a case insensitive comparision. This can be fixed by changing the collation of the search character
...
SET @originaltext = replace(@originaltext,CHAR(@i) collate Latin1_General_BIN,'')
...



Check out the following script to see what happens when using a certain collation

with Tally as
(
select top 256
number - 1 N
from
master..spt_values
where
type = 'P'
)
select
t1.N, char(t1.N), t2.N, char(t2.N)
from
Tally t1
cross join
Tally t2
where
t1.N >= 128
and t2.N < 128
and char(t1.N) = char(t2.N)



Peter
Peter Brinkhaus
Peter Brinkhaus
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3069 Visits: 7404
BTW, the same holds true for the LIKE method


declare @test table(value varchar(50))
insert into @test(value) values('a123²³')
select * from @test where value like '%[^''0-9a-Z .,-]%'
select * from @test where value like '%[^''0-9a-Z .,-]%' collate Latin1_general_BIN


Peter
Peter Brinkhaus
Peter Brinkhaus
Hall of Fame
Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)Hall of Fame (3.1K reputation)

Group: General Forum Members
Points: 3069 Visits: 7404
REMOVED: false statement
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search