REPLACE Multiple Spaces with One

  • Paul White (11/19/2009)


    Just to draw breath and summarize a bit:

    The fastest CLR solutions appear to be almost 20x faster than the fastest T-SQL implementation...

    How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?

  • Michael Meierruth (11/19/2009)


    Paul White (11/19/2009)


    Just to draw breath and summarize a bit:

    The fastest CLR solutions appear to be almost 20x faster than the fastest T-SQL implementation...

    How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?

    Michael is right of course ... but I want I want to compare fastest SQL spaces & Tabs !

  • Michael Meierruth (11/19/2009)


    How did you make this calculation if the fastest T-SQL is 306 and the fastest CLR is 100?

    Hmmm - I missed yours when I scanned down the results list - sorry about that! :blush:

    Ok, three times faster.

    I must go back and try your code for myself!

  • Interesting. All the T-SQL methods (including Michael's) seem to suffer from the collation problem I described before. Running a 100K row test (using the script I posted before), if I use a binary collation like LATIN1_GENERAL_BIN for the Data column in the test table, I get this:

    Fastest CLR: 563 ms

    Michael's T-SQL: 2,172 ms

    Using my default collation of LATIN1_GENERAL_CI_AS (this is my server, test database, and tempdb collation):

    Fastest CLR: 563 ms (yes, exactly the same)

    Michaels T-SQL: 38,859 ms

    To be fair, I tried the function-wrapped version, a function-wrapped version WITH SCHEMABINDING, and in-lining Michael's code (the in-lined version was fastest, and follows):

    SELECT

    @Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)

    replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data)),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    FROM

    #TestCleanSpaces;

    Michael's routine went back to 2,187 ms if I added COLLATE LATIN1_GENERAL_BIN after the Data column name inside all those replaces. Clearly, something very funky is going on here. So, for the best performance using the T-SQL routines, it seems necessary to manually collate the input data to a binary collation...?!

    Fast:

    replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data COLLATE LATIN1_GENERAL_BIN))

    Slow:

    replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(Data COLLATE LATIN1_GENERAL_CI_AS))

    I feel a Connect item coming on - this is a new one on me.

    Paul

  • A quick stand-alone repro for the collation thing:

    USE tempdb;

    GO

    CREATE TABLE dbo.SpaceTest

    (

    Data VARCHAR(4000) NOT NULL

    );

    GO

    INSERT dbo.SpaceTest

    SELECT TOP (10000)

    SPACE(4000)

    FROM master.sys.allocation_units A1,

    master.sys.allocation_units A2,

    master.sys.allocation_units A3,

    master.sys.allocation_units A4;

    GO

    DECLARE

    @Bitbucket VARCHAR(4000);

    SET STATISTICS TIME ON

    -- SLOW

    SELECT

    @Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_CI_AS,

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    FROM

    dbo.SpaceTest;

    -- FAST

    SELECT

    @Bitbucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data)

    REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_BIN,

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    FROM

    dbo.SpaceTest;

    SET STATISTICS TIME OFF;

    DROP TABLE dbo.SpaceTest;

    The other odd thing is that if I change the data type in the table from VARCHAR(4000) to CHAR(4000), the problem goes away. I'm convincing myself that this is either an optimizer bug, or some effect of the complex collation precedence rules (coercible-default and all that).

    Paul

  • Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Hi Paul

    Would the script pick up collation sequence from master.sys.allocation_units ?

    Just wondering if "sp_help SpaceTest" shows a bit collation..?

    Why don't we standardise and all use the same collation sequence, life would be easier:-)

    Screw

  • Hi

    I have added tab support to Michael's code by initially replacing tab with a space, which might not be best aproach.

    Anyhow here is the code and the leader board:

    CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    END

    It has however reduced performance, maybe there is a better way to achieve this in SQL.

    SQL function: Replace Technique : but using replacement chars ||*9*9||

    2785

    SQL function :Brigzy (C#Screw) Looping

    2430

    SQL function: Jeff Original : single bell char

    1958

    SQL function: Michael Meierruth : with hard coded spaces and TAB support

    484

    CLR: Brigzy (C#Screw) Looping

    393

    CLR: Not looping using ||*9*9|| technique in C#

    383

    SQL function: Michael Meierruth : with hard coded spaces

    324

    SQL function: Michael Meierruth Original

    314

    SQL function: Michael Meierruth III - extra replace

    312

    CLR: using Jeff's single char technique in C#

    282

    CLR: C#Screw II - Spaces and TABS

    229

    CLR: Flo - Spaces and TABS - v.compact

    175

    CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

    171

    CLR: Michael Meierruth - C#Screw conversion: with extra Replace

    171

    CLR :using JCBnew space replacer C#

    167

    CLR: using Flo s space replacer C#

    140

    CLR: using Paul White space replacer C#

    114

    CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS

    112

    CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232

    112

    CLR: Paul White's function with Pointers by Mark-101232

    104

  • Hang on a minute !!! :w00t:

    I realised that doing all replacements / loops whatever in the real world could well be a waste of time.

    This is because it is surely more lilely that the data doesn't include any double spaces! which might be true of say 95% of your data?

    So we only want to go to each level of replace if we really neeed to ... so recursion seems good:

    CREATE FUNCTION dbo.fn_CleanUp_Recursion_Brigzy(@Data varchar(max))

    RETURNS VarChar(Max)

    AS

    BEGIN

    IF CHARINDEX(' ',@Data) > 0

    BEGIN

    SET @data = REPLACE(@Data,' ',' ')

    IF CHARINDEX(' ',@Data) > 0

    SELECT @data = dbo.fn_CleanUp_Recursion_Brigzy(@Data)

    END

    RETURN @data

    END

    Unfortuanately our test data is FULL of double spaces so this doesn't appear high in the leader

    board::sick:

    SQL function: Jeff Original : single bell char

    5279

    SQL function: C#Screw : Recursive

    2636

    SQL function: Replace Technique : but using replacement chars ||*9*9||

    2611

    SQL function :Brigzy (C#Screw) Looping

    2514

    SQL function: Michael Meierruth : with hard coded spaces and TAB support

    391

    SQL function: Michael Meierruth Original

    318

    CLR: Brigzy (C#Screw) Looping

    313

    SQL function: Michael Meierruth III - extra replace

    306

    SQL function: Michael Meierruth : with hard coded spaces

    305

    CLR: Not looping using ||*9*9|| technique in C#

    280

    CLR: using Jeff's single char technique in C#

    236

    CLR: C#Screw II - Spaces and TABS

    189

    CLR: Flo - Spaces and TABS - v.compact

    171

    CLR :using JCBnew space replacer C#

    167

    CLR: Michael Meierruth - C#Screw conversion: with extra Replace

    156

    CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

    150

    CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS

    116

    CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232

    112

    CLR: using Flo s space replacer C#

    108

    CLR: using Paul White space replacer C#

    106

    CLR: Paul White's function with Pointers by Mark-101232

    102

    However if test data contained allot less double spaces this could look very different?

    :discuss:

    What do you think?

    Edit: Anyone got some real data to try it on?

  • Hi folks,

    Well this is the 'last post' from me but this is what things look like if there are no double spaces in the data at all:

    SQL function: Michael Meierruth : with hard coded spaces and TAB support

    863

    SQL function: Replace Technique : but using replacement chars ||*9*9||

    858

    SQL function: C#Screw : Recursive

    837

    SQL function: Michael Meierruth III - extra replace

    728

    SQL function: Michael Meierruth Original

    716

    SQL function: Michael Meierruth : with hard coded spaces

    660

    SQL function :Brigzy (C#Screw) Looping

    503

    SQL function: Jeff Original : single bell char

    444

    CLR: Michael Meierruth - C#Screw conversion: with extra Replace

    317

    CLR: Michael Meierruth - C#Screw conversion: hard coded spaces

    307

    CLR: using Jeff's single char technique in C#

    298

    CLR: Not looping using ||*9*9|| technique in C#

    268

    CLR: C#Screw II - Spaces and TABS

    256

    CLR: Brigzy (C#Screw) Looping

    246

    CLR: Flo - Spaces and TABS - v.compact

    232

    CLR :using JCBnew space replacer C#

    230

    CLR: using Paul White space replacer C#

    224

    CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232

    221

    CLR: Paul White's function with Pointers by Mark-101232

    214

    CLR: using Flo s space replacer C#

    210

    CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS

    205

    My own conclusions

    -----------------------------------

    I wouldn't like to explain some of the results above entirely - but it is nice to see Jeff has the fastest SQL, and interesting to see SQL looping just behind.

    The SQL looping does very well when there are no double spaces as it drops out on the first line of code. How Jeff's SQL can be faster than that I cannot explain.

    CLR should be at most least 2x faster; so only but considerably faster if there really is a hefty stack of looping/string work to do.

    So there we are folks, almost full circle 🙂

  • Great stuff 'C# screw' - awesome work on this thread 😎

  • TheSQLGuru (11/19/2009)


    Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.

    Oh absolutely - though I also seem to recall that the SQL collations are generally more efficient than the Windows collations on non-Unicode data - I think that's right.

    BIN surely makes a difference; in fact when using CHAR (to avoid the problem in the first place) the BIN collation is 5-15% faster than my default collation. I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.

    The real issue though, is what the engine is doing when using VARCHAR with a non-BIN collation. Recall that with a BIN collation, Michael's method took 2 seconds, but with LATIN1_GENERAL_CI_AS the execution time blew out to 40 seconds! There is only a very small difference in the XML show plan (an extra non-implicit convert to VARCHAR(4000) directly on the column) but it is hard to see why that should make so much difference.

    It's almost as if there are optimizations which allow in-place string replacement which are defeated when using VARCHAR with collation conversions. I can only imagine that the strings are being copied in memory for each row (maybe once per REPLACE) rather than being modified in place. The sheer number of allocations and copies might go some way to explain the difference.

    I will take another quick look today, but unless something obvious turns up, or someone else out there has a brainwave, this is heading for Connect for sure.

    Paul

  • I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!! 🙂

    I bet the memory crap is happening for every iteration of the REPLACE (perhaps with the vc4K thing thrown in for good measure). That really could be some significant CPU/memory hits.

    Speaking of which, can the main tester for this give a report back on CPU utilization for the test runs? And if you feel frisky, check wait stats deltas for each run. I am wondering if we might not see something show up there indicating memory and/or cpu pressure.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (11/19/2009)


    I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!! 🙂

    Ha, yeah. No, I just meant that the XML show plan is visually very, very similar (the graphical plan appears identitical!) there's just one small extra node there...

    Just so everyone knows, my timing results have always been CPU-only. I will admit to being lazy and using SET STATISTICS TIME instead of total_worker_time / execution_count from sys.dm_exec_query_stats or Profiler which I normally prefer.

    Thanks for the other thoughts too!

    Paul

  • Paul White (11/19/2009)


    I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.

    Of course - collation is important when comparing the string to replace. D'oh.

Viewing 15 posts - 166 through 180 (of 425 total)

You must be logged in to reply to this topic. Login to reply