Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Beware the REPLACE function when doing heavy string parsing

 

I found something interesting the other day.  I was attempting to optimize a very heavy string parsing routine using T-SQL and was having problems.  I don’t normally use T-SQL for such heavy string parsing, but this was a special case of a legacy structure that I had to work with, so I had no choice.  Anyway, the said routine was a table function that basically takes a fixed length text file and turns it into a table (please don’t ask why) so you can join it to other tables in a database.  The thing that made this interesting is that this fixed length text file had over 4,000 rows, so in order to turn this into a table the function has to slice and dice these 4,000 rows every time it’s executed, which it does by executing a WHILE loop. 

When I finished this function, it ran in about 15 seconds, which was way too slow for our needs.  Unbeknownst to me, one of my colleagues was attempting the same thing and wrote his code in a slightly different way, but he got it to run in about 1 second.  When I analyzed his code, I found it to be virtually identical to mine, with one difference:  during every iteration I was using the REPLACE function to ‘wittle’ down the string until none of the string was left, at which point the WHILE loop would exit and the function would finish executing.  In contrast, his approach did not have any REPLACE function, he was simply navigating forward in the string until he reached the end. 

At this point one of my very special mental “talents” kicked in:  when my mind is stuck on a problem and someone ELSE tells me the solution to it my mind immediately says “oh, I knew that!”  and proceeds to give me a detailed explanation of the solution and why it works.  And leaves you wondering where all that great information was hiding 5 seconds ago when you didn’t have the answer and you really needed it….you have to wonder why such useless “talents” even exist….

So, of course, once I saw this then everything became obvious:  my version of the function was doing over 4,000 REPLACE operations and his wasn’t.  To add to this, I have to wonder if the string datatypes, such as the VARCHAR datatype that we were both using, are implemented as immutable data types in SQL Server.  Does anyone know if this is the case?  If it is then it would further explain the difference in speed.  Since an immutable data type doesn’t really ‘change’ the instance of the string in memory and just adds a new instance of a string, this means that my function was creating over 4,000 instances in memory of the large string that I was parsing.  Talk about a waste!

I mention this because lots of SQL developers may not have much experience in string parsing, since this is normally not done using SQL.  This may be old hat to many application programmers, but it’s not something that DBAs run into as frequently, so I’m hoping that others will find it as useful as I did.

So, the moral of the story is:  be careful when using REPLACE in heavy string parsing operations.  Like many other things, if you need it you should use it, but the trick is in knowing when you REALLY need it.  In my case I thought I needed it but in reality there was a much simpler solution available, and it ended up running 15 times faster.  It proves the old adage of “Keep It Simple, Stupid”…given 2 equivalent solutions the simpler one is better.

…and that’s my $.02 for today….

SB

Comments

Posted by MitchellT on 27 August 2009

Greart article, but the suspense is killing me - what did the other developer do to get around using REPLACE. Not sure if you're trying to avoid giving code code samples in an article/editorial, but that's kind of the whole reason I'm here - to learn a better way to do things with side-by-side examples.

Thanks for any examples!

Posted by vg on 27 August 2009

teixeim, he said "he was simply navigating forward in the string until he reached the end", so my guess is he was probably using an int capture where he was in the string and substrings during the while loop.  that way, he would only be using a single large string that he searched through, instead of creating a new string with the old data "REPLACED" each time.

Sam, thanks for the tip.  So, you don't know for sure if varchar is immutable?  Sure sounds like it, but I'd like to know for sure as well.  Any MSFT ppl out there can confirm?

Posted by Ray Hastie on 27 August 2009

teixeim - check out the substring function.  If you have a fixed line definition for the fields you know that field A starts at position X and is Y characters long. The rest of the code is dependent on the circumstances.

Posted by Jeff Moden on 27 August 2009

If you'd be willing to share the file and some information about how it's to be parsed, I'd be very happy to show you how to do it WITHOUT a While loop and then you can do another comparison...

Posted by David Lean on 27 August 2009

I speculate that a SQLCLR function may be faster again. If you are looking for a seperation char ie: "," then the .NET string.Split method is extremely fast. If more complex then look at the Regex class, it is often able to separate an entire row of 10's of fields in one call. Use something like Expresso 3.0 to assist you to learn how to write powerful regex expressions.

Posted by YSLGuru on 27 August 2009

Sam,

Not sure if this has any bearing with your experince but one interesting note about the T-SQL Replace() function that is not documented in Books On Line and that wehn I posted about the SQLServerCentral.com forums, spawned an interesting debate, is that the function will always return a variable (CHAR or VARCHAR depeninding on what you pass to it) that is equal in length to the max length allowed for that data type no matter how many characters are in the variable you pass to it.  

For example if you pass in the text 'ThisIsMyString' as a VARCHAR and tell it to replace 'My' with 'Your', the value retruned by the function isn't 'ThisIsYourString', but instead retruns 'ThisIsYourString' followed by as many space characters as are needed to pad that value up the max length allowed for the data type.  

So if you are passing in a string of characters that are only 256 characters long you aregetting back, each time, a much bigger chunk of text.  What makes this hard to notice is that Management Stuido trims away those excess spaces for you so you don't realize what you see is not what was retruned by Replace().

I reported this first as a bug since BOL leads you to believe the value the function retruns is eaual to what is passed and whne that was rejected I tried to at least get Microsoft to change BOL and make explicit note of this difference (the other string functions do not work like this) and that was also struck down.

Posted by ozkary on 27 August 2009

An approach could be to change the string into xml which is faster than using another alternative like recursive CTE.

Posted by 12String on 27 August 2009

As an experienced professional in C, C++, and C# developer, I can understand the original mistake made by SB in the design.....In addition, Microsoft doesn't help by omitting or updating important information via BOL.....Which leads me to the mantra that I usually follow with BOL:  Read, Verify, Test, and Improve through Code Re-Factoring..!

Thanks for the insight!

Posted by Sam Bendayan on 27 August 2009

teixeim:  I was parsing the string line by line and using the REPLACE to get rid of the line that I had just parsed.  The other (smarter) developer would just go to the end of the line and then jump 2 characters (to get past the CRLF) so he would end up at the beginning of the next line.

SB

Posted by sjsubscribe on 27 August 2009

By selectively refactoring most string functions to CLR, I was able to reduce a job that took 40 minutes down to a few seconds. Replace may not be the only culprit, but certainly an important one. The biggest speed boost I got was by using split in CLR.

Posted by sjsubscribe on 27 August 2009

By selectively refactoring most string functions to CLR, I was able to reduce a job that took 40 minutes down to a few seconds. Replace may not be the only culprit, but certainly an important one. The biggest speed boost I got was by using split in CLR.

Posted by sjsubscribe on 27 August 2009

By selectively refactoring most string functions to CLR, I was able to reduce a job that took 40 minutes down to a few seconds. Replace may not be the only culprit, but certainly an important one. The biggest speed boost I got was by using split in CLR.

Posted by Jon Russell on 31 August 2009

Good article. Thanks, Sam.

Leave a Comment

Please register or log in to leave a comment.