REPLACE Multiple Spaces with One

  • brigzy (11/17/2009)


    Good advice.

    ? if u check out script ... it does gen table ...

    Can't really be bothered to go back that far. I'll take your word for it.

  • brigzy (11/17/2009)


    Good advice.

    ? if u check out script ... it does gen table ...

    brigzy, what do you want to here? I wrote sorry, since I misread the script of you both.

  • Florian Reischl (11/17/2009)


    brigzy (11/17/2009)


    Good advice.

    ? if u check out script ... it does gen table ...

    brigzy, what do you want to here? I wrote sorry, since I misread the script of you both.

    its all fine Flo, relates to P.White's comments, no worries at all.

    Fancy seeing a C++ version (extended sp)? I could run one up if interested ... 🙂

    Have a great evening all

    Cheers

  • As always an excellent article. Situations like this do make me wish that they would incorporate true Regex into the core engine, but of course I can also see how that would cause its own issues.

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

  • Sorry... post in progress and I accidently hit the "go" buttong too soon...

    {edit} Heh... "Buttong!!!!!" Sounds like a high velocity frozen pork chop doing it's job well at the time of impact...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry for the delay in my responses to all of this. Heh… you folks look like you had fun without me, though.

    I’ll split this response in two… first, my thanks to all who have responded with or without code. It’s been an interesting discussion and I’ll save my technical responses to the next post (hope I can get that next one out tonight. We’ll see.)

    I didn't answer some of the posts because some were pretty well answered. If I missed anyone, that's the only reason...

    John McC (11/16/2009)


    Another (very good) example of something that initially appears to be a horrendous problem having an elegant and relatively simple solution.

    Must add to briefcase 🙂

    Very nicely stated, John, and spot on intention wise. I'm always very honored especially when someone says "Must add to briefcase". Thank you, Sir.

    ChiragNS (11/16/2009)


    Nice one jeff.

    Many thanks, Chirag… like your signature by-line says, I keep trying. 😉 Glad to see that folks think so, as well.

    Matt Whitfield (11/16/2009)


    Jeff

    Nice article - I just hope that when I *finally* get round to writing mine then it's anywhere near one tenth as good as yours generally are...

    M

    Ah... thanks, Matt. Not to worry, though. When you say *finally*, it means that you're putting some thought into even if you might not have written anything down, yet. You'll not only do fine, but you'll blow the doors off that "one tenth" stuff you're talking about. 🙂 And thank you very much for all of the posts you handled on this thread… you made my job a heck of a lot easier.

    dant12 (11/16/2009)


    the article in question is for the sole purpose of avoiding this sort of approach.

    looping is bad, necessary evil sometimes but not always.

    Spot on Thiago on both accounts… and thank you, too, for all the posts you took the time to handle on this thread.

    ealter (11/16/2009)


    Excellent, clever "trick"! I hope you don't mind if I use your code! Question in regard to finding the best replacement character: wouldn't it be a simple matter to use a short 2-character replacement string (instead of the "X" in the example)? This would minimize even more the likelihood of clashing with real data. For example, using "|>" (pipe-greater than) would have a very low probability of clashing with real data, and the modification to the code would be trivial.

    It's always an honor to see someone's first post (especially one with an intelligent question) on one of my articles. I'll explain why using visible characters for this method aren't such a good idea in the technical post I have to follow this one with.

    Thank you for your post and "Welcome aboard".

    dsdeming (11/16/2009)


    Nice work, Jeff. Particularly the discussion of the potential gotchas in the choice of the "unlikely" character. It never would have occured to me that using something like the Thorn character could have such unforseen consequences.

    Dennis

    Thanks Dennis... especially for the observation on the potential gotchas. The gotcha on the Thorn character is pretty well documented on the internet but a lot of folks don't think to look for such a thing. I've not had the displeasure of being bit by it so I didn't know about, either. I just got lucky when on one of the forums where the subject came up. Wikipedia has a very nice writeup on the Thorn character. Please see the following...

    http://en.wikipedia.org/wiki/Thorn_(letter)

    ... things like the Thorn character/collation problem are one of the reasons why I use non-printable control characters instead of unlikely visible characters.

    Steve McRoberts-357330 (11/16/2009)


    Jeff, you've come up with a nice, simple method for doing this. It made me go back and look at how I had coded it in my application.

    ...

    I think I'll switch over to your method.

    Thanks!

    Immitation is the highest form of flattery... thank you much, Steve. And thanks for posting the code you made. Interesting take and I’ll have to take it for a test drive.

    Tom Garth (11/16/2009)


    How very elegant, and cool. It's so well explained that I know I'll remember it the next time I have to do some cleanup.

    Thanks Jeff!

    Thanks for the awesome compliment, Tom. The first time I saw someone "explain" how to clear up multiple spaces using a similar method, it had 4 REPLACEs and absolutly no explanation. Once I broke down what the code did, I said "Oh! Now why couldn't he have taken just a bit more time to explain how that worked for the folks new to SQL?" Heh... however... considering some of the other posts I've seen on this thread, I guess I just might have a bit more explaining to do. 😛

    Thanks again, Tom.

    Silverfox (11/16/2009)


    Good Article Jeff, appreciate you sharing it with the rest of us 😀

    Thanks for the compliment, SF... I appreciate you stopping by to read it.

    jonathan allen-270981 (11/16/2009)


    Jeff: thanks for the post, good to have the example online for reference

    Jonathan

    My thought exactly... online reference. Thanks for the compliment, Jonathan.

    I see some folks have already posted as to why you may not want to use the method you were kind enough to post. I'll try to explain a bit more on the "technical" post I'm working on.

    jcrawf02 (11/16/2009)


    My first thought was "why the intermediate step, why not just replace double occurrences of space with nothing?", but after testing it, it obviously doesn't work because you don't know if those are the only spaces. (my second thought was "Well, because Jeff said so, and he tests *everything*, but what the heck" ;-))

    Great article, because it prompted me to go play with the examples, and now I really understand why it works.

    Thanks again, Jeff!:-D

    Jon

    Heh... help prevent myths... never take anyone's word for it even though I'm truly humbled by your good thought. :blush: You (and, apparently, a few other good folks) have done exactly what I like to see and why I spend a little more time on code... I love it when people take the time to play with code especially if I'm fortunate enough for it to be my code. From what I've seen on this thread, I've got a couple of new things to play with before I write the "technical" post that I'll post after I get done with this one.

    Thanks again, Jon. I appreciate your thoughts.

    dhananjay.n.kulkarni (11/16/2009)


    Hi , Article is nice. I have little very quick one liner solution for this :

    declare @str_having_multiplespaces varchar(200)

    set @str_having_multiplespaces = 'hi onespace 2spaces 3spaces 25spaces'

    select replace(@str_having_multiplespaces , space(2) , space(0))

    --- REsult === >>> hi onespace2spaces 3spaces 25spaces

    Try this and let us know if this faster

    Thanks for stopping by, dhananjay, and I appreciate you taking the time to post some code. Did you happen to notice what it did to 2 spaces in your own post, though? The same will happen with any even number of spaces....

    --Scroll to the right to see the "24Spaces" entry

    declare @str_having_multiplespaces varchar(200)

    set @str_having_multiplespaces = 'hi onespace 2spaces 3spaces [font="Arial Black"]24[/font]spaces'

    select replace(@str_having_multiplespaces , space(2) , space(0))

    --- REsult === >>> hi onespace2spaces 3spaces 25spaces

    It is, however, a very interesting trick because it defies what REPLACE seems to be capable of doing. I'm going to have to play with that. Thanks.

    Steve Jones - Editor (11/16/2009)


    Nice job, Jeff. Learned a few things myself.

    Heh... it's always a good thing when the "boss" likes it and I can teach someone with your savy a new thing. Thanks, Steve.

    Of course, if you take a look at some of the posts people put up on this one, I’ve learned a couple of things myself.

    Madhivanan-208264 (11/16/2009)


    I think it is better not to depend on a single character. So IMO better use set of junk characters as shown here

    http://sqlblogcasts.com/blogs/madhivanan/archive/2007/12/06/squeeze-function.aspx

    Thanks for stopping by ol' friend. I do have to say, however, that one of the stipulations was to not expand the string at all. I believe Joe Celko posted about a similar method and he also states that it blows up on the very first replace if you happen to overrun the current datatype size.

    JTSash (11/16/2009)


    Jeff stated in the article that he didn't want to increase the size of the data.

    Excellent observation. Thanks for reading the article and for posting, Jason. A few other people also missed that very important point. Just a teaser on the "technical" post I'm writing... I believe that it's important to performance to NOT expand the size of the string especially if it requires the use of VARCHAR(MAX) (I may actually write an article on how the abuse of that beast can instantly slow your code down by a factor of 3 to 8).

    Anyway, thank you for jumping in, Jason.

    jwiner (11/16/2009)


    To solve this same problem in the past, I've created a CLR function or stored procedure to do a regular expression replace. You could also use the ...OA stored procedures to instantiate the VBScript Regular Expression Object do a regular expression replace.

    Thanks for stopping by. If you get the chance (I haven't read all the posts, yet, so apologies if you already did it), I'd love to see the code for you sp_OA* call and the VBScript.

    johan.vandamme (11/16/2009)


    One other point to consider is sentence endings. Normally there are two spaces at the end of sentences (for readability, etc., and yes, not everyone uses them. So a second character is necessary to "keep double spaces following a period".

    Hi johan... thanks for the suggestion. If it were me, I'd save that for a secondary update where you convert all occurances of period-space to period-space-space... It may even perform better due to the "Divide'n'Conquer" methodology.

    Either way, I'd test it for performance. Do you have some code in mind that you'd care to post?

    James Goodwin (11/16/2009)


    In order to alleviate (but not eliminate) the problem of choosing a delimiter, the final replace could replace Space+delimiter instead of just the delimiter. So as in the examples(using O for space and X for delimiter)

    Replace(Replace(replace(text,'OO','OX'),'XO',''),'OX','O')

    This way you can use as a delimiter any character that will never follow a space in the dataset which could expand your possibilities.

    --

    JimFive

    Hi Jim... Thanks for stopping by. It's a good suggestion. Just remember that you must also pick a delimiter the will never precede a space in the dataset due to the 2nd replace. Of course, if you pick something like the innocuous "bell" character, you don't have to worry about that, either.

    vopipari (11/16/2009)


    Excellent solution. I would have written a loop:

    while (charindex(' ', @myString) > 0) begin

    set @myString = replace( @myString, ' ', ' ');

    end;

    but your solution is nicer in that it can be used directly in a select statement with a result set.

    Exactly... that's one of the big reasons to avoid RBAR if you can. Hmmm... I wonder what CROSS APPLY would do with that function. I'll add it to my list of things to test. Thanks for the feedback, vopipari! 🙂

    Mark Eytcheson (11/16/2009)


    Thank you Jeff,

    Your solution you presented in the article makes so much sense and is very easy to follow. It will come in handy as I'm going through a massive data cleansing and move right now. Great article!

    Thanks for the compliment Mark especially since simplicity and clarity was the main aim. It's nice to see that I might have hit the mark for some folks.

    dhananjay.n.kulkarni (11/16/2009)


    " ||*9*9|| " solution is funtastic. Great job and thanks a lot .

    Ummmm... check it for performance against some of the other methods. Then try that method on a nearly full or full VARCHAR(8000) and see what happens. Then, to get around that problem, convert it to VARCHAR(MAX) and see what happens to performance, again.

    Michael Poppers (11/16/2009)


    😎 Thanks, Jeff!

    Thanks for stopping in and posting, Michael. I appreciate the feedback.

    TheSQLGuru (11/16/2009)


    Good solution Jeff, and well documented. However, I was a tad disappointed that you included the LTRIM(RTRIM(. That is unnecessary overhead that is unrelated to solving the stated problem (i.e. removal of multiple contiguous spaces). 😀

    Ah... good point, Kevin, and exactly correct. I'll have to remember to leave the TRIMs out when I do some summary testing of the different methods other folks put up on this thread... especially since most of them didn't include the TRIM's. 😀

    Thanks for reading the article and for the compliment.

    newjcb (11/16/2009)


    I starting writing this reply saying that this isn't very efficient at all...

    However, after I tested it with large strings (I used the HTML source of the article as the input string), I found it to be very efficient compared to my inline function implementation(s) which worked off the premise that the overhead of a function was better since I only went through the length of the string once in the loop. This premise was wrong, very wrong. The replace solution performed 4 times better than the slimmest while loop implementation I had.

    I will submit this finding: It performed even better when compiled as an inline-function

    The only limitation to this is there doesn't seem to be a way to clean other white space characters without using lots of replaces. I will have to keep my other cleaning function for now because it is designed to clean random text input and purge all non-standard text charcters. So it will reduce LF, CR, and Tabs to a single space, and then simply ignore any non-space character that doesn't fall between 33 and 126 in the ascii table while converting multiple spaces to a single space. I realize it would only take a couple replaces to get fix the whitespace characters, but filtering the non standard text charcters is not so easy.

    Absolutely correct... even "memory only" WHILE loops can be pretty tough on performance. Like any rule, there are the occasional exceptions but this isn't one of those.

    You are correct that the method in this article really has nothing to do with cleaning up other white space characters. I'll also say that we Americans do have to be careful about limiting code to the lower 126 ASCII characters in this global environment. Not sure I can pull it off the way I want, but I am playing around trying to build a high performance, T-SQL-only white space eliminator... wish me luck.

    Thanks for the post and thanks for the compliment. I appreciate it.

    brigzy (11/16/2009)


    Here are some results comparing SQL Query/SQL function/SQL C# CLR using a modified version of Thiago's script (100k records on my laptop):

    SQL Query : 4 minutes (2009-11-16 17:05:20.670 -> 2009-11-16 17:09:33.190)

    SQL Function : 1 min 33 seconds (2009-11-16 17:09:33.190 - > 2009-11-16 17:11:02.277)

    SQL CLR C# : 14 seconds (2009-11-16 17:11:02.277 - > 2009-11-16 17:11:16.077)

    Heh... it would have been really cool if you would have included the script from the article in that testing. 😛

    Joe Celko (11/16/2009)


    I posted this as a puzzle in on 2004-12-31 and had a lot of speculation about the use of Fibonacci numbers in nested REPLACE() calls. But the best version of what you have was to assume '<' and '>' do not appear in col_x.

    UPDATE Foobar

    SET col_x

    = REPLACE (

    REPLACE (

    REPLACE(col_x, SPACE(1), '<>'),

    '><', SPACE(0)),

    '<>', SPACE(1));

    This is due to a guy named Carnegie; Steve Kass found it in an old

    posting. The only problem is that it fails if the the first function call overflows the maximum string length. You might get errors, or truncation depending on your SQL.

    Ernst-Udo Wallenborn came up with am complete analysis for nested replacement, which have no overflow problems.

    Hi Joe... thanks for stopping by. I agree... that's part of the reason why I used the archaic artifact "bell" character. It avoids having to make a visible-character assumption and it has no overflow problem like some of the code on this thread.

    Nadrek (11/16/2009)


    Note that the OX style replacing, when X is multiple characters, theoretically can result in severe internal fragmentation, as the strings become longer (and thus fewer rows per page are allows... if the pages were already full, they have to split).

    Ummmm... maybe for some of the code posted on this thread but the whole goal of the article was to eliminate the spaces without making the strings any longer. The OX style replacing in the article causes no growth of the string.

    Also, understood on the nested replaces of 2 to 1 space up to 13 of those to get to 8192. That's one of the things I'm going to test on the side. I believe that you and G-Squared use similar methods.

    Thanks for stopping by to read the article and for posting.

    sjsubscribe (11/16/2009)


    In terms of speed, CLR version performs faster than built-in replace.

    ....

    Queries that used to take minutes to finish take only seconds now.

    I agree that RegEx Replace is probably one of the few places that I'd allow a CLR in my server for.

    However, on the second point... "it depends"... Matt Miller and I did a many of "drag races" a couple of years ago (I wish I could find that code again) and I came real close to beating the CLR's on a lot of the tests and did, in fact, beat some of them. I just don't want folks to mistakenly think that CLR's are some sort of code Nirvana because (again, "it depends") there are quite a few places (see other posts on this and other forums) where the T-SQL solution has either almost beaten or tied the CLR (possibly making the CLR not worth doing) or have blown the proverbial doors off the CLR.

    Florian Reischl (11/16/2009)


    @Jeff

    Nice article! Thanks for sharing. 🙂

    ...

    I just tried a simple CLR function and Jeff's way to clean data with one table and 100,000 rows to be cleaned.

    ...

    Both functions are almost equal independent how often I execute the script. Sometimes CLR is slightly faster, sometimes Jeff's function is faster.

    Greets

    Flo

    Hi Flo! I know you've been wicked busy on a project so let me say thanks for taking the time to stop by with some testing! Heh... I just got done saying that CLR's aren't necessarily the Nirvana that some folks think they are. It seems that most of the testing folks have done against the CLR (so far) have been against that "other" method and not mine. No wonder the CLR seems faster to them.:-P

    GSquared (11/16/2009)


    It's nice, but in this test, it loses by a large margin (on my desktop machine):

    You and another poster said about the same thing... I'm going to have to give it a try. Although it's not likely to have 8000 spaces (almost silly to even think that), I'll have to expand your good code example to 13 replaces so they're functionally identical so far as VARCHAR(8000) goes and see how that performs, as well.

    Thanks for stopping in with some code, Gus! 🙂

    newjcb (11/16/2009)


    I just got motivated and wrote two CLRs for this and tested on the same 2008 server from my above post with 100000 rows as per brigzy's example.

    Beginning execution loop

    Batch execution completed 100000 times.

    SQL:Looping

    11786

    SQL:

    3573

    SQL:CLRLOOP

    2650

    SQL:CLRONCE

    1873

    Heh... I love it. Keep that up and I won't have to do any summary testing. Thanks.

    To be clear though, which SQL did you use on the "SQL:" portion of this test.

    Michael Meierruth (11/16/2009)


    BUT Jeff's code seems to do this even with 8000 blanks.

    It's now up to the performance wizzards to declare who is the winner on one million records with one column containing 8000 blanks.

    Heh.. looks like we have plenty of good folks trying to figure that out. 🙂 Thanks for the feedback Michael... always appreciated.

    brigzy (11/16/2009)


    Sorry have been out 4 curry last few hours....:hehe:

    What are your system specs? Are you Express/Enterprise/Standard?

    Vista, SQL Standard 2005 SP 3 (local laptop install),

    Mem 2GBm (1.5GB in use), Intel Core Duo

    I still get the following using batch & C# CLR I posted earlier

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2520

    SQL:

    6040

    SQL:CLR

    440

    I believe that you're using the wrong SQL code. I believe that you're using the multi-character replace instead of the code from the article and that's why it's taking 6,040 ms.

    Paul White (11/16/2009)


    I'm going to run out of 'stars' at this rate 😉

    Heh... thanks Paul. I also see that you've firmly jumped into the fray. Thanks for taking up the slack for me. And thanks for the stars... I'm saving them up for when you write an article 'cause I know it'll be good. 😉

    brigzy (11/16/2009)


    Taking the ||*9*9|| technique down to C# is sweet

    This brings the C# down to just one line of code.

    If the ||*9*9|| string was reduced it could be even faster..

    Here are the test results:

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2556

    SQL:

    6433

    SQL:CLR looping

    443

    Ah... now I see. Why don't you try the code from the article, as well?

    Paul White (11/16/2009)


    brigzy (11/16/2009)[/b]


    Taking the ||*9*9|| technique down to C# is sweet!

    It seems pretty obvious that a .NET solution will usually be faster. What I liked about this article was the encouragement to think set-based and apply some old-school creative thinking to an old problem. No doubt a thorough treatment of which solution is 'best' (for some well-defined value of 'best') would make an interesting article.

    Paul

    Heh... they haven't tested against the code in the article. When Flo did, it was a draw.

    The rest of what you say is correct and thank you for that... think "set-based" and "Divide'n'Conquer".

    bornsql (11/16/2009)


    Hi how about this code whether this will work fine or not

    DECLARE @tmpstr2 Varchar(MAX)

    SELECT @tmpstr2=replace(OriginalString, char(10) + char(13), '') FROM @demo

    SELECT @tmpstr2=replace(@tmpstr2,' ','|')

    SELECT @tmpstr2=replace(@tmpstr2,'|','')

    SELECT @tmpstr2

    It doesn't work so well. Take a peek...

    DECLARE @tmpstr2 Varchar(MAX)

    DECLARE @OriginalString VARCHAR(MAX)

    SET @OriginalString = 'There are ten spaces after this which means there are ten space before this.'

    SELECT @tmpstr2=replace(@OriginalString, char(10) + char(13), '')

    SELECT @tmpstr2=replace(@tmpstr2,' ','|')

    SELECT @tmpstr2=replace(@tmpstr2,'|','')

    SELECT @tmpstr2

    Results...

    Therearetenspacesafterthiswhichmeanstherearetenspacebeforethis.

    brigzy (11/16/2009)


    Paul White (11/16/2009)


    brigzy (11/16/2009)


    Taking the ||*9*9|| technique down to C# is sweet!

    It seems pretty obvious that a .NET solution will usually be faster.

    Yes we we did clarify expected/obvious:

    We might have expected c# to have the day, but the Query/Function result is surprising.

    I think it is quite resonable and hopefully of general interest to discuss fastest C# technique on the subject. Seeing how simple the C# code is might also encourage pure T-SQL folk to venture C#.

    Heh... nope. You didn't. None of the testing you have done has included the code from the article. 😉 See Flo's test... the CLR is not worth it in this case.

    Pieter-423357 (11/17/2009)


    Pretty good trick, but do you really want to subject each row to the triple REPLACE function?

    REPLACE is a costly function and will run for a good while when you process millions of rows. I would add something to screen out entries with 2 or more spaces and perform action on those alone.

    Heh... thanks for the feedback Pieter... but like Bob said, see the WHERE clause in the article code...

    timothyawiseman (11/17/2009)


    As always an excellent article. Situations like this do make me wish that they would incorporate true Regex into the core engine, but of course I can also see how that would cause its own issues.

    Thanks for the feedback and the compliment, Timothy. I appreciate it.

    I agree... especially since so many folks are prone to doing a little "word processing" in SQL Server. 😀

    SDM (11/17/2009)


    Jeff: Neat. Whether it's the best on SQL Server or not is only partly the point (for me). I just like the different approach to the problem. Don't attack it directly, transform it to a smaller set of problems that _do_ lend themselves to a direct solution.

    To the CLR freaks: The OP stated it had to be SQL 2000.

    Heh... Thanks for the feedback... "Divide'n'Conquer" is one of my favorite methods. I'll also add that I'm not a big fan of CLR's because, as Flo's testing showed (someone FINALLY tested the code in article against a CLR), there are very few places where a CLR is actually better or where they have enough performance gain to actually justify maintaining the separate code written in a different language.

    That being said, be nice to the CLR folks... BWAA-HAA!!! They think us T-SQL-only folks are freaks. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Weeellllll.... that took a bit (a LOT!!!) longer than I thought. I'll try to get to some of my own testing on some of your contributions tomorrow night.

    I've got to say it again, well done, folks. There have been a lot of good contributions on this thread and although I saw someone starting to take aim because of a misunderstanding, I'm real happy to see that no food fights broke out on what could have been a very touchy thread. Thanks to you all for that. 🙂

    Flo... special thanks again on the test you did... maybe I just missed it, but it doesn't look like anyone else compared the code in the article to a CLR and that's what I was most curious about.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/17/2009)


    Weeellllll.... that took a bit (a LOT!!!) longer than I thought. I'll try to get to some of my own testing on some of your contributions tomorrow night.

    I've got to say it again, well done, folks. There have been a lot of good contributions on this thread and although I saw someone starting to take aim because of a misunderstanding, I'm real happy to see that no food fights broke out on what could have been a very touchy thread. Thanks to you all for that. 🙂

    Flo... special thanks again on the test you did... maybe I just missed it, but it doesn't look like anyone else compared the code in the article to a CLR and that's what I was most curious about.

    Building a replica of your code in C# and running tests for you now ....

  • brigzy (11/18/2009)


    Jeff Moden (11/17/2009)


    Weeellllll.... that took a bit (a LOT!!!) longer than I thought. I'll try to get to some of my own testing on some of your contributions tomorrow night.

    I've got to say it again, well done, folks. There have been a lot of good contributions on this thread and although I saw someone starting to take aim because of a misunderstanding, I'm real happy to see that no food fights broke out on what could have been a very touchy thread. Thanks to you all for that. 🙂

    Flo... special thanks again on the test you did... maybe I just missed it, but it doesn't look like anyone else compared the code in the article to a CLR and that's what I was most curious about.

    Building a replica of your code in C# and running tests for you now ....

    Jeff,

    as promised : all tests combined SQL & CLR

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2536

    SQL: using ||*9*9||

    6310

    SQL: Jeffs single char

    2916

    CLR: looping

    453

    CLR: Not looping using ||*9*9|| technique in C#

    296

    CLR: Not looping using Jeff's single char technique in C#

    416 <-- consistently c# performs slower single space bell character, surprising?

    Correction : Noticed one space missing in c#

    CLR: Not looping using Jeff's single char technique in C#

    246

    I am really sorry the c# input has caused some upset, I thought it was really useful/interesting.. so 10/10 article & productive discussion.

    [SqlFunction()]

    public static String fn_TidySpace_SingleChar_CLR(string CleanMe)

    {

    return CleanMe.Trim().Replace(" ", " \a").Replace("\a ", "").Replace("\a", "");

    }

    SQL batch FYI

    --PREPARE

    SET NOCOUNT ON

    go

    CREATE FUNCTION dbo.fn_CleanUp(@FooString VARCHAR(max))

    RETURNS VARCHAR(max)

    BEGIN

    WHILE CHARINDEX(' ',@FooString) > 0

    SELECT @FooString = REPLACE(@FooString,' ',' ')

    RETURN @FooString

    END

    GO

    CREATE TABLE #TEMP1 (COL1 VARCHAR(900))

    CREATE TABLE #TEMP2 (COL2 VARCHAR(900), COL3 VARCHAR(900), COL4 VARCHAR(900),COL5 VARCHAR(900))

    go

    --INSERT 200k ROWS WITH RANDOM SPACES ON EACH TABLE, SEPARATE TABLES ARE USED TO AVOID CACHING, THIS MAY TAKE QUITE AWHILE

    DECLARE @SPACECOUNT1 INT,@SPACECOUNT2 INT,@SPACECOUNT3 INT,@SPACECOUNT4 INT

    SELECT @SPACECOUNT1 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT),@SPACECOUNT2 = CAST(CAST(NEWID() AS VARBINARY(1)) AS INT)

    INSERT INTO #TEMP1 (COL1)

    OUTPUT inserted.COL1 INTO #TEMP2 (COL2)

    SELECT 'TEST1'+SPACE(@SPACECOUNT1)+'TEST2'+SPACE(@SPACECOUNT2)+'TEST3'

    GO 10000

    --select * FROM #TEMP1

    --select * FROM #TEMP2

    --SELECTS

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL4= dbo.fn_CleanUp(COL2)

    FROM #TEMP2

    PRINT 'SQL:Looping'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL3 = LTRIM(RTRIM(

    REPLACE(REPLACE(REPLACE(COL2,' ',' ||*9*9||'),'||*9*9|| ',''),'||*9*9||','')

    ))

    FROM #TEMP2

    PRINT 'SQL: using ||*9*9||'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    GO

    DECLARE @TheTime DATETIME

    SELECT @TheTime= GETDATE()

    UPDATE #TEMP2 SET COL3 =

    REPLACE(

    REPLACE(

    REPLACE(

    LTRIM(RTRIM(COL2)) ,' ',' '+CHAR(7)) --Changes 2 spaces to the OX model

    ,CHAR(7)+' ','') --Changes the XO model to nothing

    ,CHAR(7),'') --Changes the remaining X's to nothing

    FROM #TEMP2

    PRINT 'SQL: Jeffs single char'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_Looping_CLR(COL2)

    FROM #TEMP2

    PRINT 'CLR: looping '

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_NotLooping_CLR(COL2)

    FROM #TEMP2

    PRINT 'CLR: Not looping using ||*9*9|| technique in C#'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    DECLARE @TheTime DATETIME

    SET @TheTime=GETDATE()

    UPDATE #TEMP2 SET COL5= dbo.fn_TidySpace_SingleChar_CLR(COL2)

    FROM #TEMP2

    PRINT 'CLR: Not looping using Jeff''s single char technique in C#'

    PRINT DATEDIFF(ms,@TheTime,GETDATE())

    go

    --CLEANUP

    DROP FUNCTION dbo.fn_CleanUp

    DROP TABLE #TEMP1

    DROP TABLE #TEMP2

    GO

  • To the C# crew:

    This seems to be the fastest method so far (on my machine at least)...would one or more of you please try it out?

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions

    {

    /// <summary>

    /// The general plan here is to scan the input string left to right,

    /// writing the de-double-spaced result back into the same buffer

    /// as we go.

    ///

    /// While scanning, when we come across a space, we arrange to

    /// skip following spaces. Skipping forward means advancing the

    /// read position while not moving the write position pointer.

    ///

    /// Since the output string will always be less than or equal in

    /// length to the input string, this is safe to do.

    ///

    /// The idea is that manipulating the buffer directly (but still in safe

    /// managed code!) might be faster than the built-in string methods.

    ///

    /// The char[] buffer internal to a SqlChars object is mutable -

    /// unlike strings which are destroyed and re-created when changed.

    ///

    /// </summary>

    /// <param name="Input">The string to process, as SqlChars</param>

    /// <returns>The processed result, as a SqlString</returns>

    [SqlFunction

    (

    DataAccess = DataAccessKind.None,

    SystemDataAccess = SystemDataAccessKind.None,

    IsDeterministic = true,

    IsPrecise = true

    )

    ]

    [return: SqlFacet(MaxSize = 4000)]

    public static SqlString SpaceReplacer

    (

    [SqlFacet(IsNullable = false, MaxSize = 4000)] SqlChars Input

    )

    {

    const char SPACE = ' ';

    bool skipping = false; // In space-skipping mode

    char[] input = Input.Buffer; // Reference to the internal char[] buffer of the SqlChars input

    int length = input.Length; // Remember the length of the original

    int writePos = 0; // Initialise the writing position

    // Loop through each character

    // scan = current scan position

    // write = current write position

    for (int readPos = 0; readPos < length; readPos++)

    {

    // Skip spaces if we are in 'skip mode' (see later)

    if (skipping)

    {

    // Space skipping

    while (readPos < length && input[readPos] == SPACE) { readPos++; }

    // Bail if we ran out of string

    if (readPos >= length) { break; }

    }

    // If the write position is before the scan position,

    // the output string is no longer in sync with the input

    // (we must have skipped spaces at some point).

    // So, we need to copy the current character

    // from the read position to the write position

    if (writePos < readPos) { input[writePos] = input[readPos]; }

    // Advance the write position

    // (for all non-spaces, and for the first space in a sequence of spaces)

    writePos++;

    // Set or reset skip mode

    skipping = (input[readPos] == SPACE);

    }

    // Return a string constructed from the part of

    // the buffer we actually wrote to

    return new SqlString(new string(input, 0, writePos));

    }

    };

  • Hey Jeff,

    The performance of the code in the article (unless I have really messed up) seems very dependent on collation. For a long time yesterday, I couldn't understand why your routine was so slow on my machine.

    Anyway, words are just words, so here's the test rig (sorry I am rushed tonight, so it's not fantastic):

    -- Test table drop/create

    IF OBJECT_ID(N'tempdb..#TestCleanSpaces', N'U') IS NOT NULL DROP TABLE #TestCleanSpaces;

    GO

    CREATE TABLE #TestCleanSpaces

    (

    -- Note this use of VARCHAR (not NVARCHAR) benefits T-SQL

    -- since CLR routines have to work with Unicode.

    -- The CLR implementations will have to convert every input string...!

    --

    -- Limited to 4000 characters to avoid skewing the test

    -- completely by forcing the CLR routines to accept

    -- and return MAX data types.

    Data VARCHAR(4000) /*COLLATE LATIN1_GENERAL_BIN*/ NOT NULL -- also try LATIN1_GENERAL_CI_AS and SQL_LATIN1_GENERAL_CP1_CI_AS

    );

    GO

    -- 100K rows of test data

    ;WITH Random (r1, r2, r3)

    AS (

    SELECT

    ABS(CHECKSUM(NEWID())),

    ABS(CHECKSUM(NEWID())),

    ABS(CHECKSUM(NEWID()))

    ),

    Numbers (Num)

    AS (

    SELECT

    TOP (100000)

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM

    sys.allocation_units N1,

    sys.allocation_units N2,

    sys.allocation_units N3,

    sys.allocation_units N4

    )

    INSERT

    #TestCleanSpaces (Data)

    SELECT

    REPLICATE(LEFT(CONVERT(CHAR(36), NEWID()), r1 % 36) + SPACE(r2 % 20), r3 % 20)

    FROM

    Numbers N

    CROSS

    JOIN

    Random;

    GO

    -- The test run

    DECLARE

    @Bitbucket VARCHAR(4000),

    @BitBucketUnicode NVARCHAR(4000);

    SET STATISTICS TIME ON

    -- My .NET function

    SELECT

    @BitBucketUnicode = dbo.SpaceReplacer(Data)

    FROM

    #TestCleanSpaces;

    SELECT

    @Bitbucket = REPLACE(REPLACE(REPLACE(Data,' ',' ' + CHAR(7)),CHAR(7) + ' ',''),CHAR(7),'')

    FROM

    #TestCleanSpaces

    WHERE

    CHARINDEX(' ', Data) > 0;

    SET STATISTICS TIME OFF

    GO

    -- Tidy up

    DROP TABLE #TestCleanSpaces;

    Without using a COLLATE clause on the temporary table, I get the following (100K rows):

    T-SQL: 21,219 ms

    CLR: 557 ms

    Using LATIN1_GENERAL_CI_AS or SQL_LATIN1_GENERAL_CP1_CI_AS, I get about the same: T-SQL always over 20 seconds, CLR 550-625 ms!

    Using LATIN1_GENERAL_BIN I get:

    T-SQL: 2,042 ms

    CLR: 578 ms

    I'm going to say that the optimizer can use a few tricks to minimize the number of times that the string is copied when using multiple REPLACEs, but only if some conditions are met - primarily it seems on collation. It'd be great if you could confirm this behaviour on your rig...? Interesting, no?

    (Just comment out the CLR function call :-))

    Paul

    P.S. Thanks to Flo for the basis of the rig

  • With Pauls space replacer :

    Beginning execution loop

    Batch execution completed 10000 times.

    SQL:Looping

    2546

    SQL: using ||*9*9||

    6196

    SQL: Jeffs single char

    2943

    CLR: looping

    456

    CLR: Not looping using ||*9*9|| technique in C#

    293

    CLR: Not looping using Jeff's single char technique in C#

    250

    CLR: using Paul White space replacer C#

    110

  • Thank you 'brigzy'...you seem to have a much faster machine than my old laptop!

  • You welcome dude I prefer brigzy to c# scew! :-):-)

    Reading through your c# very nice

  • brigzy (11/18/2009)


    You welcome dude I prefer brigzy to c# scew! :-):-)

    Reading through your c# very nice

    oops! I misread, my mind I read your post as C# 'Screw', and I wrote as 'C# scew'!!

    Both pretty close to truth anyway!

    :-):

    In fact that is so amusing I have changed my name on here from brigzy to c# Screw 😀

Viewing 15 posts - 91 through 105 (of 425 total)

You must be logged in to reply to this topic. Login to reply