Replace Bad Characters in Table

  • Comments posted to this topic are about the item Replace Bad Characters in Table

  • I doubt, that this will be fast. It does not use an explicit cursor, but practically does the same (or something worser) with its while loop.

    If someone really needs this stuff, I'd suggest to wait for the SQL 2017 release and use the new TRANSLATE() function.

    God is real, unless declared integer.

  • I agree with franz, this will be slow. I´m not sure about the performance of the TRANSLATE function (MS has done terrible things with FORMAT), but here's a much faster version that will actually be set based instead of using RBAR. The code is also much shorter.

    WITH cteBook AS(
      SELECT *,
       STUFF(( SELECT SUBSTRING( TEXT, n, 1)
          FROM Tally
          WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
          AND n <= LEN(TEXT)
          FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') CleanText
      FROM #BOOK
    )
    UPDATE cteBook
    SET TEXT = CleanText;

    With 10,000 rows the method posted in the article takes several minutes. This code completes in less than one second. It can also be converted into an iTVF to ease its use.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Regardless of anything else, when doing string operations of this nature, there are two important best practices to follow:

    • ALWAYS use a binary Collation so that you match the characters you are expecting to match. Otherwise you could match unexpected characters, the set of which varies by Collation. For example, "a" will match other forms of "a" (with and without accents), when using an accent insensitive Collation (i.e. anything with _AI in the name). It is preferable to use a _BIN2 Collation rather than the deprecated _BIN Collations (as of SQL Server 2008), and it is preferable to use the most recent version of the Collation. For anyone using "Latin1_General_*" on SQL Server 2008 or newer, the preferred binary Collation is: Latin1_General_100_BIN2. Luis's code above correctly uses a binary Collation and so is more accurate than the code in the article.
    • It is best to use the NVARCHAR datatype as it is more inclusive and won't lead to data loss. This means also prefixing string literals with a capital "N".

    Take care, Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • @Luis Cazares
    Your code is faulty. That STUFF is not needed and it causes at least the first letter omission from the text. This is how it should be:

    WITH cteBook AS(
    SELECT
        *,
      (
            SELECT SUBSTRING( TEXT, n, 1)
            FROM Tally
            WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
            AND n <= LEN(TEXT)
            FOR XML PATH(''), TYPE
        ).value('./text()[1]', 'varchar(max)') CleanText
    FROM #BOOK
    )
    UPDATE cteBook
    SET TEXT = CleanText;

    The WHERE clause in the CleanText subquery already filters out the characters that don't match the regular expression. However this only eliminates "bad" characters but does not replace them with something equivalent in the case of encoding mismatch. For instance, how do you solve a text like this:

    Sir George Étienne Cartier Parkway

    If you eliminate Ã‰ the result will be an incomplete street name.

    Don't just give the hungry man a fish, teach him how to catch it as well.

    the sqlist

  • t.franz - Wednesday, June 28, 2017 2:58 AM

    I doubt, that this will be fast. It does not use an explicit cursor, but practically does the same (or something worser) with its while loop.

    If someone really needs this stuff, I'd suggest to wait for the SQL 2017 release and use the new TRANSLATE() function.

    For people that don't have SQL Server 2017 you can use this Translate function for SQL Server. It's scalar, which I normally don't recommend, but it was faster than any inline table valued function that I could come up with. 

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • the sqlist - Wednesday, June 28, 2017 9:59 AM

    @Luis Cazares
    Your code is faulty. That STUFF is not needed and it causes at least the first letter omission from the text. This is how it should be:

    WITH cteBook AS(
    SELECT
        *,
      (
            SELECT SUBSTRING( TEXT, n, 1)
            FROM Tally
            WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
            AND n <= LEN(TEXT)
            FOR XML PATH(''), TYPE
        ).value('./text()[1]', 'varchar(max)') CleanText
    FROM #BOOK
    )
    UPDATE cteBook
    SET TEXT = CleanText;

    The WHERE clause in the CleanText subquery already filters out the characters that don't match the regular expression. However this only eliminates "bad" characters but does not replace them with something equivalent in the case of encoding mismatch. For instance, how do you solve a text like this:

    Sir George Étienne Cartier Parkway

    If you eliminate Ãƒâ€° the result will be an incomplete street name.

    You're right, I forgot to remove that when I included the template. The code becomes even simpler. If a replacement is needed, we would just need to use a CASE expression and remove the WHERE clause. I coded it to only remove characters because that's what the code from the article is doing. I'm usually against changing data like this because it's easy to get into trouble and change data that shouldn't be changed.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares - Wednesday, June 28, 2017 7:00 AM

    I agree with franz, this will be slow. I´m not sure about the performance of the TRANSLATE function (MS has done terrible things with FORMAT), but here's a much faster version that will actually be set based instead of using RBAR. The code is also much shorter.

    WITH cteBook AS(
      SELECT *,
       STUFF(( SELECT SUBSTRING( TEXT, n, 1)
          FROM Tally
          WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
          AND n <= LEN(TEXT)
          FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') CleanText
      FROM #BOOK
    )
    UPDATE cteBook
    SET TEXT = CleanText;

    With 10,000 rows the method posted in the article takes several minutes. This code completes in less than one second. It can also be converted into an iTVF to ease its use.

    It has been converted into a very well-tested, high-performing iSF: PatExclude8K. The solution would look like this:
    UPDATE #Book
    SET [Text] = NewString
    FROM #Book
    CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']');

    It's also worth noting that a permanent tally table (if that's what the tally table referenced in your code is) will perform poorly compared to an inline (I'm aware that you know this Luis - I mention for other people's benefit). 
    I put together a test to compare the SELECT statement performance of Luis' solution against patexclude8K with a serial and parallel plan using a perm tally table, then again using an inline tally table. Here's the tally DDL:
    CREATE TABLE dbo.tally
    (
    N bigint NOT NULL,
    CONSTRAINT pk_cl_dbo_tally PRIMARY KEY CLUSTERED (N ASC),
    CONSTRAINT uq_nc_dbo_tally UNIQUE NONCLUSTERED (N ASC)
    );
    INSERT dbo.tally
    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT 1))
    FROM sys.all_columns a, sys.all_columns b;

    USE tempdb
    GO

    IF OBJECT_ID('tempdb..#Book') IS NOT NULL DROP TABLE #Book;
    CREATE TABLE #Book (iid int, TEXT varchar(8000));

    INSERT #Book
    SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), REPLACE(newid(),'9',' ')
    FROM sys.all_columns a, sys.all_columns b;

    Test #1 - PatExclude8K vs solution with perm tally

    PRINT 'Luis - serial'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x =
    ( SELECT SUBSTRING( TEXT, n, 1)
      FROM dbo.Tally
      WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
      AND n <= LEN(TEXT)
      FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText
    FROM #Book
    OPTION (MAXDOP 1);
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    PRINT char(10)+'PatExclude8K - serial'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x = NewString
    FROM #Book
    CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
    OPTION (MAXDOP 1);
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    PRINT 'Luis - parallel'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x =
    ( SELECT SUBSTRING( TEXT, n, 1)
      FROM dbo.Tally
      WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
      AND n <= LEN(TEXT)
      FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(8000)') --CleanText
    FROM #Book
    CROSS APPLY dbo.make_parallel();
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    PRINT char(10)+'PatExclude8K - parallel'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x = NewString
    FROM #Book
    CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
    CROSS APPLY dbo.make_parallel();
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    Results:

    Luis - serial
    --------------------------------------------------
    Beginning execution loop
    4387
    3830
    4030
    4250
    3917
    Batch execution completed 5 times.

    PatExclude8K - serial
    --------------------------------------------------
    Beginning execution loop
    2314
    2166
    2300
    2553
    2224
    Batch execution completed 5 times.

    Luis - parallel
    --------------------------------------------------
    Beginning execution loop
    1120
    1153
    1160
    1387
    1223
    Batch execution completed 5 times.

    PatExclude8K - parallel
    --------------------------------------------------
    Beginning execution loop
    677
    660
    700
    683
    716
    Batch execution completed 5 times.

    Test #2 - Same test but with an inline tally for Luis' solution

    PRINT 'Luis - serial(inline tally)'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x =
    ( SELECT SUBSTRING( TEXT, n, 1)
      FROM
      (
      SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
      FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) a(x),
        (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) b(x),
        (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) c(x),
        (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) d(x)
      ) iTally(N)
      WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
      AND n <= LEN(TEXT)
      FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText
    FROM #Book
    OPTION (MAXDOP 1);
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    PRINT char(10)+'PatExclude8K - serial'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x = NewString
    FROM #Book
    CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
    OPTION (MAXDOP 1);
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    PRINT 'Luis - parallel (inline tally)'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x =
    ( SELECT SUBSTRING( TEXT, n, 1)
      FROM
      (
      SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1))
      FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) a(x),
        (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) b(x),
        (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) c(x),
        (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) d(x)
      ) iTally(N)
      WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin
      AND n <= LEN(TEXT)
      FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText
    FROM #Book
    CROSS APPLY dbo.make_parallel();
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    PRINT char(10)+'PatExclude8K - parallel'+char(10)+replicate('-', 50);
    GO
    DECLARE @st datetime = getdate(), @x varchar(8000);
    SELECT @x = NewString
    FROM #Book
    CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']')
    CROSS APPLY dbo.make_parallel();
    PRINT DATEDIFF(MS, @st, getdate());
    GO 5

    Results

    Luis - serial (inline tally)
    --------------------------------------------------
    Beginning execution loop
    2376
    2384
    2323
    2334
    2387
    Batch execution completed 5 times.

    PatExclude8K - serial
    --------------------------------------------------
    Beginning execution loop
    2217
    2273
    2307
    2170
    2370
    Batch execution completed 5 times.

    Luis - parallel (inline tally)
    --------------------------------------------------
    Beginning execution loop
    684
    663
    694
    716
    664
    Batch execution completed 5 times.

    PatExclude8K - parallel
    --------------------------------------------------
    Beginning execution loop
    677
    746
    827
    730
    700
    Batch execution completed 5 times.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • that's some awesome coding there!  Thanks for the improvements 🙂

     

    ST

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply