Tally table ASCII validator - Anyone ever did this?

  • Hi all,

    I'm trying to validate a database data, as we found some unknown characters in the descriptions the vendors send us. I'm trying to use a tally table, to find the unknown characters (under ascii(32) or over ascii(151)).

    This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly... The table I will check afterwards with this code contains about 1 million description, in french and English.

    Anyone care to take a look?

    DECLARE @test-2 TABLE ( value VARCHAR(50) )

    INSERT INTO @test-2

    ( value

    )

    SELECT 'This is a test, no errors'

    UNION ALL

    SELECT 'ÿ ascii(152) must be trapped'

    UNION ALL

    SELECT '▼Unknown 31 ascii character must be trapped'

    UNION ALL

    SELECT 'ö ascii(148) must not be trapped' ;

    WITH cteTally

    AS ( SELECT row_number() OVER ( ORDER BY o.ID ) AS N

    FROM master.sys.sysobjects o

    CROSS JOIN master.sys.sysobjects o2

    ),

    ctedesc

    AS ( SELECT ASCII(SUBSTRING(t.value, n, 1)) AS [ascii] ,

    SUBSTRING(t.value, n, 1) AS Ch

    FROM @test-2 t

    INNER JOIN cteTally ON n <= LEN(t.value)

    )

    SELECT *

    FROM ctedesc

    WHERE [ascii] < 32 -- Anything under 32 must be trapped

    OR [ascii] > 151 -- Anything over 151 must be trapped

    If any of you already done something like this, please give me some pointers,

    Thanks in advance,

    Cheers,

    J-F

  • J-F, to find non-compliant data with high ascii, i think you can just use this:

    select * from @test-2 where value like '%[^''a-Z .,-]%'

    i think you'd use a Tally table if you were stripping out the offending chars;

    here's some test code and a Tally function i just slapped together:

    select * from @test-2 where value like '%[^''a-Z .,-]%'

    select * from @test-2 where value like '%[^' + CHAR(128) + '-' + CHAR(255) + ']%'

    select dbo.StripHighAscii(value) from @test-2

    with a tally table: this version strips out any non-expected characters also though, not just high ascii, but it's obvious what you can change:

    ALTER FUNCTION StripHighAscii(@OriginalText VARCHAR(max))

    RETURNS VARCHAR(8000)

    BEGIN

    DECLARE @CleanedText VARCHAR(max)

    ;WITH E01(N) AS (SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1 UNION ALL

    SELECT 1 UNION ALL SELECT 1), -- 10 or 10E01 rows

    E02(N) AS (SELECT 1 FROM E01 a, E01 b), -- 100 or 10E02 rows

    E04(N) AS (SELECT 1 FROM E02 a, E02 b), -- 10,000 or 10E04 rows

    E08(N) AS (SELECT 1 FROM E04 a, E04 b), --100,000,000 or 10E08 rows

    E16(N) AS (SELECT 1 FROM E08 a, E08 b), --10E16 or more rows than you'll EVER need

    Tally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E16)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) IN (9,10,13) THEN SUBSTRING(@OriginalText,Tally.N,1) --tab,lf,cr

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 32 AND 127 THEN SUBSTRING(@OriginalText,Tally.N,1)

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) <= 32 THEN ''

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) >= 128 THEN ''

    END

    FROM Tally WHERE Tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    another saved StripHighAscii for reference:

    create function [dbo].[StripHighAscii](@originaltext varchar(8000))

    returns varchar(8000)

    begin

    declare @i int

    set @i=127

    while @i < 255

    begin

    SET @originaltext = replace(@originaltext,CHAR(@i),'')

    SET @i=@i +1

    end

    return @originaltext

    end

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Nice Lowell, very interesting stripping function you got there, thanks for sharing.

    I'm not sure about the "like" you have used, I use like a lot, but never used it to generate some kind of "normal expression" like you did. I will have to look more closely as what characters I need to use to ensure my data is correct.

    The thing is, yes I want to eventually strip the unwanted characters, but I need to identify them first, and having the whole string alltogether sometimes makes it hard to see characters that are not even visible to me (i.e. char(160)). We had a weird behavior when sending the information, and we need to detect them.

    I will look at the normal expression you can create with the like operator, it is probably going to help.

    Thanks again,

    Cheers,

    J-F

  • I have a weird behavior when using your stripping function. Seems the weird triangle in the third example translates to a question mark... I'm not sure of that behavior, maybe the character translates to 2 bytes?

    Cheers,

    J-F

  • This is what I came up with, but even in this small example, it takes a long time to execute, so I must be doing something silly...

    The reason that it's so slow is because you are generating a very large tally table (~3 million rows) on the fly. You only need a tally table with numbers up to 50 (the length of the value column).

    Use TOP 50 to limit the size of the tally table. No need for a CROSS JOIN also.

    ...

    WITH cteTally

    AS ( SELECT TOP 50 row_number() OVER ( ORDER BY o.ID ) AS N

    FROM master.sys.sysobjects o

    ),

    ...

    Performance goes from 13sec to 0sec on my machine.

  • Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

    By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

    Thanks again,

    Cheers,

    J-F

  • J-F Bergeron (2/18/2010)


    Thanks Peter, it makes sense that limiting the number of rows in the tally table makes it faster. I'm just surprised the "Where len(value) <= tally.N" did not make it faster, but I guess this is ok.

    By the way, thanks for the help all, I've been able to identify the weird characters in the table, and we'll choose which character we do not want anymore.

    Thanks again,

    Lowell's code does that correctly.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (2/18/2010)


    ...


    Lowell's code does that correctly.

    Well, not exactly. The second version of StripHighAscii (the one with the loop) can give you very odd results depending on the collation being used. I use collation Latin1_general_CI_AS and

    select dbo.StripHighAscii('a123²³') give me 'a' as a result, because the replace function does a case insensitive comparision. This can be fixed by changing the collation of the search character

    ...

    SET @originaltext = replace(@originaltext,CHAR(@i) collate Latin1_General_BIN,'')

    ...

    Check out the following script to see what happens when using a certain collation

    with Tally as

    (

    select top 256

    number - 1 N

    from

    master..spt_values

    where

    type = 'P'

    )

    select

    t1.N, char(t1.N), t2.N, char(t2.N)

    from

    Tally t1

    cross join

    Tally t2

    where

    t1.N >= 128

    and t2.N < 128

    and char(t1.N) = char(t2.N)

    Peter

  • BTW, the same holds true for the LIKE method

    declare @test-2 table(value varchar(50))

    insert into @test-2(value) values('a123²³')

    select * from @test-2 where value like '%[^''0-9a-Z .,-]%'

    select * from @test-2 where value like '%[^''0-9a-Z .,-]%' collate Latin1_general_BIN

    Peter

  • REMOVED: false statement

  • My take on Lowell's good contribution earlier.

    Lowell, I hope you don't mind 😎

    ALTER FUNCTION dbo.StripHighAscii(@OriginalText VARCHAR(MAX))

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    (

    WITH

    L0 AS (SELECT 1 AS C UNION ALL SELECT 1),

    L1 AS (SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B),

    L2 AS (SELECT 1 AS C FROM L1 AS A CROSS JOIN L1 AS B),

    L3 AS (SELECT 1 AS C FROM L2 AS A CROSS JOIN L2 AS B),

    L4 AS (SELECT 1 AS C FROM L3 AS A CROSS JOIN L3 AS B),

    L5 AS (SELECT 1 AS C FROM L4 AS A CROSS JOIN L4 AS B),

    Numbers

    AS (

    -- Just the sequence numbers we need, in the right order

    SELECT TOP (DATALENGTH(@OriginalText))

    ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS N

    FROM L5

    ORDER BY

    N

    ),

    Cleaned(cleaned_text)

    AS (

    SELECT CASE

    WHEN ASCII(SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1)) BETWEEN 32 AND 127 THEN SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1)

    WHEN ASCII(SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1)) IN (9, 10, 13) THEN SUBSTRING(@OriginalText COLLATE LATIN1_GENERAL_BIN, N, 1) --TAB,LF,CR

    ELSE ''

    END

    FROM Numbers

    FOR XML PATH (''), TYPE

    )

    SELECT cleaned_text = cleaned_text.value('.[1]', 'VARCHAR(MAX)')

    FROM Cleaned

    )

    GO

    DECLARE @ToClean

    TABLE (data VARCHAR(MAX) NOT NULL);

    -- Test data

    INSERT @ToClean (data) VALUES ('This is a test, no errors');

    INSERT @ToClean (data) VALUES ('ÿ ascii(152) must be trapped');

    INSERT @ToClean (data) VALUES (CHAR(31) + 'Unknown 31 ascii character must be trapped');

    INSERT @ToClean (data) VALUES ('ö ascii(148) must not be trapped');

    INSERT @ToClean (data) VALUES ('a123²³');

    -- Cleaning

    SELECT SHA.cleaned_text

    FROM @ToClean

    CROSS

    APPLY dbo.StripHighAscii([@ToClean].data) SHA;

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply