REPLACE Multiple Spaces with One

Question

REPLACE Multiple Spaces with One

Viewing 15 posts - 166 through 180 (of 425 total)

You must be logged in to reply to this topic. Login to reply

Michael Meierruth SSChampion Points: 10051 More actions · Answer 1

Paul White (11/19/2009)
Just to draw breath and summarize a bit:
The fastest CLR solutions appear to be almost 20x faster than the fastest T-SQL implementation...

How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?

C# Gnu SSCertifiable Points: 5973 More actions · Answer 2

Michael Meierruth (11/19/2009)
Paul White (11/19/2009)
Just to draw breath and summarize a bit:
The fastest CLR solutions appear to be almost 20x faster than the fastest T-SQL implementation...
How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?

Michael is right of course ... but I want I want to compare fastest SQL spaces & Tabs !

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Paul White SSC Guru Points: 150468 More actions · Answer 3

Michael Meierruth (11/19/2009)
How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?

Hmmm - I missed yours when I scanned down the results list - sorry about that! :blush:

Ok, three times faster.

I must go back and try your code for myself!

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 4

Interesting. All the T-SQL methods (including Michael's) seem to suffer from the collation problem I described before. Running a 100K row test (using the script I posted before), if I use a binary collation like LATIN1_GENERAL_BIN for the Data column in the test table, I get this:

Fastest CLR: 563 ms

Michael's T-SQL: 2,172 ms

Using my default collation of LATIN1_GENERAL_CI_AS (this is my server, test database, and tempdb collation):

Fastest CLR: 563 ms (yes, exactly the same)

Michaels T-SQL: 38,859 ms

To be fair, I tried the function-wrapped version, a function-wrapped version WITH SCHEMABINDING, and in-lining Michael's code (the in-lined version was fastest, and follows):

SELECT

@Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)

replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data)),

' ',' '),

' ',' ')

FROM

#TestCleanSpaces;

Michael's routine went back to 2,187 ms if I added COLLATE LATIN1_GENERAL_BIN after the Data column name inside all those replaces. Clearly, something very funky is going on here. So, for the best performance using the T-SQL routines, it seems necessary to manually collate the input data to a binary collation...?!

Fast:

replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data COLLATE LATIN1_GENERAL_BIN))

Slow:

replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data COLLATE LATIN1_GENERAL_CI_AS))

I feel a Connect item coming on - this is a new one on me.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 5

A quick stand-alone repro for the collation thing:

USE tempdb;

GO

CREATE TABLE dbo.SpaceTest

(

Data VARCHAR(4000) NOT NULL

);

GO

INSERT dbo.SpaceTest

SELECT TOP (10000)

SPACE(4000)

FROM master.sys.allocation_units A1,

master.sys.allocation_units A2,

master.sys.allocation_units A3,

master.sys.allocation_units A4;

GO

DECLARE

@Bitbucket VARCHAR(4000);

SET STATISTICS TIME ON

-- SLOW

SELECT

@Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_CI_AS,

' ',' '),

' ',' ')

FROM

dbo.SpaceTest;

-- FAST

SELECT

@Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)

REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_BIN,

' ',' '),

' ',' ')

FROM

dbo.SpaceTest;

SET STATISTICS TIME OFF;

DROP TABLE dbo.SpaceTest;

The other odd thing is that if I change the data type in the table from VARCHAR(4000) to CHAR(4000), the problem goes away. I'm convincing myself that this is either an optimizer bug, or some effect of the complex collation precedence rules (coercible-default and all that).

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 6

Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

C# Gnu SSCertifiable Points: 5973 More actions · Answer 7

Hi Paul

Would the script pick up collation sequence from master.sys.allocation_units ?

Just wondering if "sp_help SpaceTest" shows a bit collation..?

Why don't we standardise and all use the same collation sequence, life would be easier:-)

Screw

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 8

Hi

I have added tab support to Michael's code by initially replacing tab with a space, which might not be best aproach.

Anyhow here is the code and the leader board:

CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(8000))

RETURNS VARCHAR(8000)

BEGIN

RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))),

' ',' '),

' ',' ')

END

It has however reduced performance, maybe there is a better way to achieve this in SQL.

SQL function: Replace Technique : but using replacement chars ||*9*9||

2785

SQL function :Brigzy (C#Screw) Looping

2430

SQL function: Jeff Original : single bell char

1958

SQL function: Michael Meierruth : with hard coded spaces and TAB support

484

CLR: Brigzy (C#Screw) Looping

393

CLR: Not looping using ||*9*9|| technique in C#

383

SQL function: Michael Meierruth : with hard coded spaces

324

SQL function: Michael Meierruth Original

314

SQL function: Michael Meierruth III - extra replace

312

CLR: using Jeff's single char technique in C#

282

CLR: C#Screw II - Spaces and TABS

229

CLR: Flo - Spaces and TABS - v.compact

175

CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

171

CLR: Michael Meierruth - C#Screw conversion: with extra Replace

171

CLR :using JCBnew space replacer C#

167

CLR: using Flo s space replacer C#

140

CLR: using Paul White space replacer C#

114

CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS

112

CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232

112

CLR: Paul White's function with Pointers by Mark-101232

104

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 9

Hang on a minute !!! :w00t:

I realised that doing all replacements / loops whatever in the real world could well be a waste of time.

This is because it is surely more lilely that the data doesn't include any double spaces! which might be true of say 95% of your data?

So we only want to go to each level of replace if we really neeed to ... so recursion seems good:

CREATE FUNCTION dbo.fn_CleanUp_Recursion_Brigzy(@Data varchar(max))

RETURNS VarChar(Max)

AS

BEGIN

IF CHARINDEX(' ',@Data) > 0

BEGIN

SET @data = REPLACE(@Data,' ',' ')

IF CHARINDEX(' ',@Data) > 0

SELECT @data = dbo.fn_CleanUp_Recursion_Brigzy(@Data)

END

RETURN @data

END

Unfortuanately our test data is FULL of double spaces so this doesn't appear high in the leader

board::sick:

SQL function: Jeff Original : single bell char

5279

SQL function: C#Screw : Recursive

2636

SQL function: Replace Technique : but using replacement chars ||*9*9||

2611

SQL function :Brigzy (C#Screw) Looping

2514

SQL function: Michael Meierruth : with hard coded spaces and TAB support

391

SQL function: Michael Meierruth Original

318

CLR: Brigzy (C#Screw) Looping

313

SQL function: Michael Meierruth III - extra replace

306

SQL function: Michael Meierruth : with hard coded spaces

305

CLR: Not looping using ||*9*9|| technique in C#

280

CLR: using Jeff's single char technique in C#

236

CLR: C#Screw II - Spaces and TABS

189

CLR: Flo - Spaces and TABS - v.compact

171

CLR :using JCBnew space replacer C#

167

CLR: Michael Meierruth - C#Screw conversion: with extra Replace

156

CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

150

CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS

116

CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232

112

CLR: using Flo s space replacer C#

108

CLR: using Paul White space replacer C#

106

CLR: Paul White's function with Pointers by Mark-101232

102

However if test data contained allot less double spaces this could look very different?

:discuss:

What do you think?

Edit: Anyone got some real data to try it on?

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

C# Gnu SSCertifiable Points: 5973 More actions · Answer 10

Hi folks,

Well this is the 'last post' from me but this is what things look like if there are no double spaces in the data at all:

SQL function: Michael Meierruth : with hard coded spaces and TAB support

863

SQL function: Replace Technique : but using replacement chars ||*9*9||

858

SQL function: C#Screw : Recursive

837

SQL function: Michael Meierruth III - extra replace

728

SQL function: Michael Meierruth Original

716

SQL function: Michael Meierruth : with hard coded spaces

660

SQL function :Brigzy (C#Screw) Looping

503

SQL function: Jeff Original : single bell char

444

CLR: Michael Meierruth - C#Screw conversion: with extra Replace

317

CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

307

CLR: using Jeff's single char technique in C#

298

CLR: Not looping using ||*9*9|| technique in C#

268

CLR: C#Screw II - Spaces and TABS

256

CLR: Brigzy (C#Screw) Looping

246

CLR: Flo - Spaces and TABS - v.compact

232

CLR :using JCBnew space replacer C#

230

CLR: using Paul White space replacer C#

224

CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232

221

CLR: Paul White's function with Pointers by Mark-101232

214

CLR: using Flo s space replacer C#

210

CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS

205

My own conclusions

-----------------------------------

I wouldn't like to explain some of the results above entirely - but it is nice to see Jeff has the fastest SQL, and interesting to see SQL looping just behind.

The SQL looping does very well when there are no double spaces as it drops out on the first line of code. How Jeff's SQL can be faster than that I cannot explain.

CLR should be at ~~most~~ least 2x faster; ~~so only~~ but considerably faster if there really is a hefty stack of looping/string work to do.

So there we are folks, almost full circle

C# Gnu
____________________________________________________

Excel Database Tasks : efficient business data processing
Version 3.3 released February 2015

Paul White SSC Guru Points: 150468 More actions · Answer 11

Great stuff 'C# screw' - awesome work on this thread

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 12

TheSQLGuru (11/19/2009)
Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.

Oh absolutely - though I also seem to recall that the SQL collations are generally more efficient than the Windows collations on non-Unicode data - I think that's right.

BIN surely makes a difference; in fact when using CHAR (to avoid the problem in the first place) the BIN collation is 5-15% faster than my default collation. I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.

The real issue though, is what the engine is doing when using VARCHAR with a non-BIN collation. Recall that with a BIN collation, Michael's method took 2 seconds, but with LATIN1_GENERAL_CI_AS the execution time blew out to 40 seconds! There is only a very small difference in the XML show plan (an extra non-implicit convert to VARCHAR(4000) directly on the column) but it is hard to see why that should make so much difference.

It's almost as if there are optimizations which allow in-place string replacement which are defeated when using VARCHAR with collation conversions. I can only imagine that the strings are being copied in memory for each row (maybe once per REPLACE) rather than being modified in place. The sheer number of allocations and copies might go some way to explain the difference.

I will take another quick look today, but unless something obvious turns up, or someone else out there has a brainwave, this is heading for Connect for sure.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

TheSQLGuru SSC Guru Points: 134017 More actions · Answer 13

I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!!

I bet the memory crap is happening for every iteration of the REPLACE (perhaps with the vc4K thing thrown in for good measure). That really could be some significant CPU/memory hits.

Speaking of which, can the main tester for this give a report back on CPU utilization for the test runs? And if you feel frisky, check wait stats deltas for each run. I am wondering if we might not see something show up there indicating memory and/or cpu pressure.

Best,
Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru on googles mail service

Paul White SSC Guru Points: 150468 More actions · Answer 14

TheSQLGuru (11/19/2009)
I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!!

Ha, yeah. No, I just meant that the XML show plan is visually very, very similar (the graphical plan appears identitical!) there's just one small extra node there...

Just so everyone knows, my timing results have always been CPU-only. I will admit to being lazy and using SET STATISTICS TIME instead of total_worker_time / execution_count from sys.dm_exec_query_stats or Profiler which I normally prefer.

Thanks for the other thoughts too!

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Paul White SSC Guru Points: 150468 More actions · Answer 15

Paul White (11/19/2009)
I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.

Of course - collation is important when comparing the string to replace. D'oh.

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

REPLACE Multiple Spaces with One

Cookies on SQLServerCentral