|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 10:33 AM
Points: 10,989,
Visits: 10,529
|
|
A quick stand-alone repro for the collation thing:
USE tempdb; GO CREATE TABLE dbo.SpaceTest ( Data VARCHAR(4000) NOT NULL ); GO INSERT dbo.SpaceTest SELECT TOP (10000) SPACE(4000) FROM master.sys.allocation_units A1, master.sys.allocation_units A2, master.sys.allocation_units A3, master.sys.allocation_units A4; GO DECLARE @BitBucket VARCHAR(4000);
SET STATISTICS TIME ON
-- SLOW SELECT @BitBucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data) REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_CI_AS, ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' ') FROM dbo.SpaceTest; -- FAST SELECT @BitBucket = --dbo.fn_CleanUp_MichaelMeierruth_MKII(Data) REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(Data COLLATE LATIN1_GENERAL_BIN, ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' ') FROM dbo.SpaceTest;
SET STATISTICS TIME OFF;
DROP TABLE dbo.SpaceTest;
The other odd thing is that if I change the data type in the table from VARCHAR(4000) to CHAR(4000), the problem goes away. I'm convincing myself that this is either an optimizer bug, or some effect of the complex collation precedence rules (coercible-default and all that).
Paul
Paul White SQL Server MVP SQLblog.com @SQL_Kiwi
|
|
|
|
|
Hall of Fame
       
Group: General Forum Members
Last Login: Today @ 7:38 AM
Points: 3,573,
Visits: 5,109
|
|
Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria.
Best,
Kevin G. Boles SQL Server Consultant SQL MVP 2007-2012 TheSQLGuru at GMail
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 3:35 AM
Points: 224,
Visits: 629
|
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 3:35 AM
Points: 224,
Visits: 629
|
|
Hi I have added tab support to Michael's code by initially replacing tab with a space, which might not be best aproach.
Anyhow here is the code and the leader board:
CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(8000)) RETURNS VARCHAR(8000) BEGIN RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' '), ' ',' ') END It has however reduced performance, maybe there is a better way to achieve this in SQL.
SQL function: Replace Technique : but using replacement chars ||*9*9|| 2785 SQL function :Brigzy (C#Screw) Looping 2430 SQL function: Jeff Original : single bell char 1958 SQL function: Michael Meierruth : with hard coded spaces and TAB support 484 CLR: Brigzy (C#Screw) Looping 393 CLR: Not looping using ||*9*9|| technique in C# 383 SQL function: Michael Meierruth : with hard coded spaces 324 SQL function: Michael Meierruth Original 314 SQL function: Michael Meierruth III - extra replace 312 CLR: using Jeff's single char technique in C# 282 CLR: C#Screw II - Spaces and TABS 229 CLR: Flo - Spaces and TABS - v.compact 175 CLR: Michael Meierruth - C#Screw conversion: hard coded spaces 171 CLR: Michael Meierruth - C#Screw conversion: with extra Replace 171 CLR :using JCBnew space replacer C# 167 CLR: using Flo s space replacer C# 140 CLR: using Paul White space replacer C# 114 CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS 112 CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232 112 CLR: Paul White's function with Pointers by Mark-101232 104
C# Gnu ____________________________________________________
Excel Database Tasks V2 .. Multi user data edit Task based solution - Excel with SQL Server / Access / MySQL ..
Version 2.5 released - See Blog
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 3:35 AM
Points: 224,
Visits: 629
|
|
Hang on a minute !!! 
I realised that doing all replacements / loops whatever in the real world could well be a waste of time.
This is because it is surely more lilely that the data doesn't include any double spaces! which might be true of say 95% of your data?
So we only want to go to each level of replace if we really neeed to ... so recursion seems good:
CREATE FUNCTION dbo.fn_CleanUp_Recursion_Brigzy(@Data varchar(max)) RETURNS VarChar(Max) AS BEGIN IF CHARINDEX(' ',@Data) > 0 BEGIN SET @Data = REPLACE(@Data,' ',' ') IF CHARINDEX(' ',@Data) > 0 SELECT @Data = dbo.fn_CleanUp_Recursion_Brigzy(@Data) END RETURN @Data END Unfortuanately our test data is FULL of double spaces so this doesn't appear high in the leader board:
SQL function: Jeff Original : single bell char 5279 SQL function: C#Screw : Recursive 2636 SQL function: Replace Technique : but using replacement chars ||*9*9|| 2611 SQL function :Brigzy (C#Screw) Looping 2514 SQL function: Michael Meierruth : with hard coded spaces and TAB support 391 SQL function: Michael Meierruth Original 318 CLR: Brigzy (C#Screw) Looping 313 SQL function: Michael Meierruth III - extra replace 306 SQL function: Michael Meierruth : with hard coded spaces 305 CLR: Not looping using ||*9*9|| technique in C# 280 CLR: using Jeff's single char technique in C# 236 CLR: C#Screw II - Spaces and TABS 189 CLR: Flo - Spaces and TABS - v.compact 171 CLR :using JCBnew space replacer C# 167 CLR: Michael Meierruth - C#Screw conversion: with extra Replace 156 CLR: Michael Meierruth - C#Screw conversion: hard coded spaces 150 CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS 116 CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232 112 CLR: using Flo s space replacer C# 108 CLR: using Paul White space replacer C# 106 CLR: Paul White's function with Pointers by Mark-101232 102
However if test data contained allot less double spaces this could look very different?

What do you think?
Edit: Anyone got some real data to try it on?
C# Gnu ____________________________________________________
Excel Database Tasks V2 .. Multi user data edit Task based solution - Excel with SQL Server / Access / MySQL ..
Version 2.5 released - See Blog
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 3:35 AM
Points: 224,
Visits: 629
|
|
Hi folks, Well this is the 'last post' from me but this is what things look like if there are no double spaces in the data at all:
SQL function: Michael Meierruth : with hard coded spaces and TAB support 863 SQL function: Replace Technique : but using replacement chars ||*9*9|| 858 SQL function: C#Screw : Recursive 837 SQL function: Michael Meierruth III - extra replace 728 SQL function: Michael Meierruth Original 716 SQL function: Michael Meierruth : with hard coded spaces 660 SQL function :Brigzy (C#Screw) Looping 503 SQL function: Jeff Original : single bell char 444 CLR: Michael Meierruth - C#Screw conversion: with extra Replace 317 CLR: Michael Meierruth - C#Screw conversion: hard coded spaces 307 CLR: using Jeff's single char technique in C# 298 CLR: Not looping using ||*9*9|| technique in C# 268 CLR: C#Screw II - Spaces and TABS 256 CLR: Brigzy (C#Screw) Looping 246 CLR: Flo - Spaces and TABS - v.compact 232 CLR :using JCBnew space replacer C# 230 CLR: using Paul White space replacer C# 224 CLR: Paul White's function, reduced by C#Screw with Pointers ideas from Mark-101232 221 CLR: Paul White's function with Pointers by Mark-101232 214 CLR: using Flo s space replacer C# 210 CLR: C#Screw : Slightly reduced Paul White's function and Added Support TABS 205
My own conclusions ----------------------------------- I wouldn't like to explain some of the results above entirely - but it is nice to see Jeff has the fastest SQL, and interesting to see SQL looping just behind.
The SQL looping does very well when there are no double spaces as it drops out on the first line of code. How Jeff's SQL can be faster than that I cannot explain.
CLR should be at most least 2x faster; so only but considerably faster if there really is a hefty stack of looping/string work to do.
So there we are folks, almost full circle
C# Gnu ____________________________________________________
Excel Database Tasks V2 .. Multi user data edit Task based solution - Excel with SQL Server / Access / MySQL ..
Version 2.5 released - See Blog
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 10:33 AM
Points: 10,989,
Visits: 10,529
|
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 10:33 AM
Points: 10,989,
Visits: 10,529
|
|
TheSQLGuru (11/19/2009) Paul, I seem to recall other topics from the years whereby BIN collation was the way to go for efficiency reasons if that was your most important criteria. Oh absolutely - though I also seem to recall that the SQL collations are generally more efficient than the Windows collations on non-Unicode data - I think that's right.
BIN surely makes a difference; in fact when using CHAR (to avoid the problem in the first place) the BIN collation is 5-15% faster than my default collation. I must admit I am less than clear how sort order affects nested REPLACEs, but at least REPLACE is documented as being collation-aware so there is some sense to it.
The real issue though, is what the engine is doing when using VARCHAR with a non-BIN collation. Recall that with a BIN collation, Michael's method took 2 seconds, but with LATIN1_GENERAL_CI_AS the execution time blew out to 40 seconds! There is only a very small difference in the XML show plan (an extra non-implicit convert to VARCHAR(4000) directly on the column) but it is hard to see why that should make so much difference.
It's almost as if there are optimizations which allow in-place string replacement which are defeated when using VARCHAR with collation conversions. I can only imagine that the strings are being copied in memory for each row (maybe once per REPLACE) rather than being modified in place. The sheer number of allocations and copies might go some way to explain the difference.
I will take another quick look today, but unless something obvious turns up, or someone else out there has a brainwave, this is heading for Connect for sure.
Paul
Paul White SQL Server MVP SQLblog.com @SQL_Kiwi
|
|
|
|
|
Hall of Fame
       
Group: General Forum Members
Last Login: Today @ 7:38 AM
Points: 3,573,
Visits: 5,109
|
|
I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!! :)
I bet the memory crap is happening for every iteration of the REPLACE (perhaps with the vc4K thing thrown in for good measure). That really could be some significant CPU/memory hits.
Speaking of which, can the main tester for this give a report back on CPU utilization for the test runs? And if you feel frisky, check wait stats deltas for each run. I am wondering if we might not see something show up there indicating memory and/or cpu pressure.
Best,
Kevin G. Boles SQL Server Consultant SQL MVP 2007-2012 TheSQLGuru at GMail
|
|
|
|
|
SSChampion
        
Group: General Forum Members
Last Login: Wednesday, May 08, 2013 10:33 AM
Points: 10,989,
Visits: 10,529
|
|
TheSQLGuru (11/19/2009) I am not sure that blowing out a varchar(4000) conversion for lord knows what is a 'minor plan difference'!! :) Ha, yeah. No, I just meant that the XML show plan is visually very, very similar (the graphical plan appears identitical!) there's just one small extra node there...
Just so everyone knows, my timing results have always been CPU-only. I will admit to being lazy and using SET STATISTICS TIME instead of total_worker_time / execution_count from sys.dm_exec_query_stats or Profiler which I normally prefer.
Thanks for the other thoughts too!
Paul
Paul White SQL Server MVP SQLblog.com @SQL_Kiwi
|
|
|
|