Replace Bad Characters in Table

Question

Replace Bad Characters in Table

souLTower

SSCarpal Tunnel

Points: 4026
More actions
June 27, 2017 at 11:16 pm

#402223

Comments posted to this topic are about the item Replace Bad Characters in Table

Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

Thomas Franz Hall of Fame Points: 3992 More actions · Answer 1

I doubt, that this will be fast. It does not use an explicit cursor, but practically does the same (or something worser) with its while loop.

If someone really needs this stuff, I'd suggest to wait for the SQL 2017 release and use the new TRANSLATE() function.

God is real, unless declared integer.

Luis Cazares SSC Guru Points: 183706 More actions · Answer 2

I agree with franz, this will be slow. I´m not sure about the performance of the TRANSLATE function (MS has done terrible things with FORMAT), but here's a much faster version that will actually be set based instead of using RBAR. The code is also much shorter.
WITH cteBook AS( SELECT *, STUFF(( SELECT SUBSTRING( TEXT, n, 1) FROM Tally WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') CleanText FROM #BOOK ) UPDATE cteBook SET TEXT = CleanText;
With 10,000 rows the method posted in the article takes several minutes. This code completes in less than one second. It can also be converted into an iTVF to ease its use.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Solomon Rutzky SSCoach Points: 16431 More actions · Answer 3

Regardless of anything else, when doing string operations of this nature, there are two important best practices to follow:

ALWAYS use a binary Collation so that you match the characters you are expecting to match. Otherwise you could match unexpected characters, the set of which varies by Collation. For example, "a" will match other forms of "a" (with and without accents), when using an accent insensitive Collation (i.e. anything with _AI in the name). It is preferable to use a _BIN2 Collation rather than the deprecated _BIN Collations (as of SQL Server 2008), and it is preferable to use the most recent version of the Collation. For anyone using "Latin1_General_*" on SQL Server 2008 or newer, the preferred binary Collation is: Latin1_General_100_BIN2. Luis's code above correctly uses a binary Collation and so is more accurate than the code in the article.
It is best to use the NVARCHAR datatype as it is more inclusive and won't lead to data loss. This means also prefixing string literals with a capital "N".

Take care, Solomon..

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

the sqlist SSCrazy Points: 2398 More actions · Answer 4

@Luis Cazares
Your code is faulty. That STUFF is not needed and it causes at least the first letter omission from the text. This is how it should be:
WITH cteBook AS( SELECT *, ( SELECT SUBSTRING( TEXT, n, 1) FROM Tally WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE ).value('./text()[1]', 'varchar(max)') CleanText FROM #BOOK ) UPDATE cteBook SET TEXT = CleanText;
The WHERE clause in the CleanText subquery already filters out the characters that don't match the regular expression. However this only eliminates "bad" characters but does not replace them with something equivalent in the case of encoding mismatch. For instance, how do you solve a text like this:

Sir George Ã‰tienne Cartier Parkway

If you eliminate Ã‰ the result will be an incomplete street name.

Don't just give the hungry man a fish, teach him how to catch it as well.

the sqlist

Alan Burstein SSC Guru Points: 61152 More actions · Answer 5

t.franz - Wednesday, June 28, 2017 2:58 AM
I doubt, that this will be fast. It does not use an explicit cursor, but practically does the same (or something worser) with its while loop.
If someone really needs this stuff, I'd suggest to wait for the SQL 2017 release and use the new TRANSLATE() function.

For people that don't have SQL Server 2017 you can use this Translate function for SQL Server. It's scalar, which I normally don't recommend, but it was faster than any inline table valued function that I could come up with.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Luis Cazares SSC Guru Points: 183706 More actions · Answer 6

the sqlist - Wednesday, June 28, 2017 9:59 AM
@Luis Cazares
Your code is faulty. That STUFF is not needed and it causes at least the first letter omission from the text. This is how it should be:
WITH cteBook AS( SELECT *, ( SELECT SUBSTRING( TEXT, n, 1) FROM Tally WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE ).value('./text()[1]', 'varchar(max)') CleanText FROM #BOOK ) UPDATE cteBook SET TEXT = CleanText;
The WHERE clause in the CleanText subquery already filters out the characters that don't match the regular expression. However this only eliminates "bad" characters but does not replace them with something equivalent in the case of encoding mismatch. For instance, how do you solve a text like this:
Sir George Ãƒâ€°tienne Cartier Parkway
If you eliminate Ãƒâ€° the result will be an incomplete street name.

You're right, I forgot to remove that when I included the template. The code becomes even simpler. If a replacement is needed, we would just need to use a CASE expression and remove the WHERE clause. I coded it to only remove characters because that's what the code from the article is doing. I'm usually against changing data like this because it's easy to get into trouble and change data that shouldn't be changed.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Alan Burstein SSC Guru Points: 61152 More actions · Answer 7

Luis Cazares - Wednesday, June 28, 2017 7:00 AM
I agree with franz, this will be slow. IÂ´m not sure about the performance of the TRANSLATE function (MS has done terrible things with FORMAT), but here's a much faster version that will actually be set based instead of using RBAR. The code is also much shorter.
WITH cteBook AS( SELECT *, STUFF(( SELECT SUBSTRING( TEXT, n, 1) FROM Tally WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)'), 1, 1, '') CleanText FROM #BOOK ) UPDATE cteBook SET TEXT = CleanText;
With 10,000 rows the method posted in the article takes several minutes. This code completes in less than one second. It can also be converted into an iTVF to ease its use.

It has been converted into a very well-tested, high-performing iSF: PatExclude8K. The solution would look like this:
UPDATE #Book SET [Text] = NewString FROM #Book CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']');

It's also worth noting that a permanent tally table (if that's what the tally table referenced in your code is) will perform poorly compared to an inline (I'm aware that you know this Luis - I mention for other people's benefit).
I put together a test to compare the SELECT statement performance of Luis' solution against patexclude8K with a serial and parallel plan using a perm tally table, then again using an inline tally table. Here's the tally DDL:
CREATE TABLE dbo.tally ( N bigint NOT NULL, CONSTRAINT pk_cl_dbo_tally PRIMARY KEY CLUSTERED (N ASC), CONSTRAINT uq_nc_dbo_tally UNIQUE NONCLUSTERED (N ASC) ); INSERT dbo.tally SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM sys.all_columns a, sys.all_columns b;

USE tempdb GO

IF OBJECT_ID('tempdb..#Book') IS NOT NULL DROP TABLE #Book; CREATE TABLE #Book (iid int, TEXT varchar(8000));

INSERT #Book SELECT TOP (100000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), REPLACE(newid(),'9',' ') FROM sys.all_columns a, sys.all_columns b;

Test #1 - PatExclude8K vs solution with perm tally
PRINT 'Luis - serial'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = ( SELECT SUBSTRING( TEXT, n, 1) FROM dbo.Tally WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText FROM #Book OPTION (MAXDOP 1); PRINT DATEDIFF(MS, @st, getdate()); GO 5

PRINT char(10)+'PatExclude8K - serial'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = NewString FROM #Book CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']') OPTION (MAXDOP 1); PRINT DATEDIFF(MS, @st, getdate()); GO 5

PRINT 'Luis - parallel'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = ( SELECT SUBSTRING( TEXT, n, 1) FROM dbo.Tally WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(8000)') --CleanText FROM #Book CROSS APPLY dbo.make_parallel(); PRINT DATEDIFF(MS, @st, getdate()); GO 5

PRINT char(10)+'PatExclude8K - parallel'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = NewString FROM #Book CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']') CROSS APPLY dbo.make_parallel(); PRINT DATEDIFF(MS, @st, getdate()); GO 5

Results:
Luis - serial -------------------------------------------------- Beginning execution loop 4387 3830 4030 4250 3917 Batch execution completed 5 times.

PatExclude8K - serial -------------------------------------------------- Beginning execution loop 2314 2166 2300 2553 2224 Batch execution completed 5 times.

Luis - parallel -------------------------------------------------- Beginning execution loop 1120 1153 1160 1387 1223 Batch execution completed 5 times.

PatExclude8K - parallel -------------------------------------------------- Beginning execution loop 677 660 700 683 716 Batch execution completed 5 times.

Test #2 - Same test but with an inline tally for Luis' solution
PRINT 'Luis - serial(inline tally)'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = ( SELECT SUBSTRING( TEXT, n, 1) FROM ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) a(x), (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) b(x), (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) c(x), (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) d(x) ) iTally(N) WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText FROM #Book OPTION (MAXDOP 1); PRINT DATEDIFF(MS, @st, getdate()); GO 5

PRINT char(10)+'PatExclude8K - serial'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = NewString FROM #Book CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']') OPTION (MAXDOP 1); PRINT DATEDIFF(MS, @st, getdate()); GO 5

PRINT 'Luis - parallel (inline tally)'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = ( SELECT SUBSTRING( TEXT, n, 1) FROM ( SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) a(x), (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) b(x), (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) c(x), (VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) d(x) ) iTally(N) WHERE SUBSTRING( TEXT, n, 1) LIKE '[0-9a-zA-Z ,.:;'']' COLLATE Latin1_General_Bin AND n <= LEN(TEXT) FOR XML PATH(''), TYPE).value('./text()[1]', 'varchar(max)') --CleanText FROM #Book CROSS APPLY dbo.make_parallel(); PRINT DATEDIFF(MS, @st, getdate()); GO 5

PRINT char(10)+'PatExclude8K - parallel'+char(10)+replicate('-', 50); GO DECLARE @st datetime = getdate(), @x varchar(8000); SELECT @x = NewString FROM #Book CROSS APPLY dbo.patexclude8K([Text], '[^0-9a-zA-Z ,.:;'']') CROSS APPLY dbo.make_parallel(); PRINT DATEDIFF(MS, @st, getdate()); GO 5

Results
Luis - serial (inline tally) -------------------------------------------------- Beginning execution loop 2376 2384 2323 2334 2387 Batch execution completed 5 times.

PatExclude8K - serial -------------------------------------------------- Beginning execution loop 2217 2273 2307 2170 2370 Batch execution completed 5 times.

Luis - parallel (inline tally) -------------------------------------------------- Beginning execution loop 684 663 694 716 664 Batch execution completed 5 times.

PatExclude8K - parallel -------------------------------------------------- Beginning execution loop 677 746 827 730 700 Batch execution completed 5 times.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

souLTower SSCarpal Tunnel Points: 4026 More actions · Answer 8

that's some awesome coding there! Thanks for the improvements 🙂

ST