REPLACE Multiple Spaces with One

  • Michael Meierruth

    SSCrazy Eights

    Points: 9991

    Steven Willis (5/26/2013)


    I'm late to this party, but here's my entirely TSQL offering to remove duplicate spaces. Most of the code below has been not so shamelessly borrowed from Jeff Moden's work, especially the famous DelimitedSplit8K function which I have found to be one of the most useful tools in the toolbox.

    I have no idea how this compares to the other solutions but I thought I'd just throw it into the mix. It's based mostly off of some other variations I've developed from the DelimitedSplit8K function and I think it takes a different approach to the problem. Basically, it splits the string and then puts it back together again. Whether that's better or not I'll let the scorekeeper determine.

     

    OK. Yours is 6344ms. Mine is 2594ms.

  • Steven Willis

    SSCrazy Eights

    Points: 9893

    Michael Meierruth (5/28/2013)


    Steven Willis (5/26/2013)


    I'm late to this party, but here's my entirely TSQL offering to remove duplicate spaces. Most of the code below has been not so shamelessly borrowed from Jeff Moden's work, especially the famous DelimitedSplit8K function which I have found to be one of the most useful tools in the toolbox.

    I have no idea how this compares to the other solutions but I thought I'd just throw it into the mix. It's based mostly off of some other variations I've developed from the DelimitedSplit8K function and I think it takes a different approach to the problem. Basically, it splits the string and then puts it back together again. Whether that's better or not I'll let the scorekeeper determine.

     

    OK. Yours is 6344ms. Mine is 2594ms.

    Thanks for taking the time to test this. My initial test using client statistics did show your method to be faster on my 1000 row test table. What testing method did you use? Do you think scalability becomes a factor for any of these methods where one might work better than another for a very large or very wide table? Not disputing the results--just wondering.

     

  • Gary Harding

    Ten Centuries

    Points: 1232

    Michael Meierruth (5/28/2013)


    Steven Willis (5/26/2013)


    I'm late to this party, but here's my entirely TSQL offering to remove duplicate spaces. Most of the code below has been not so shamelessly borrowed from Jeff Moden's work, especially the famous DelimitedSplit8K function which I have found to be one of the most useful tools in the toolbox.

    I have no idea how this compares to the other solutions but I thought I'd just throw it into the mix. It's based mostly off of some other variations I've developed from the DelimitedSplit8K function and I think it takes a different approach to the problem. Basically, it splits the string and then puts it back together again. Whether that's better or not I'll let the scorekeeper determine.

     

    OK. Yours is 6344ms. Mine is 2594ms.

    DelimitedSplit8K has many and varied uses, though I'm afraid this isn't one that flatters it.

    Michael's nested REPLACEs might look cumbersome, but they can't be beat for speed, especially inline.

    I tried out this more concise UDF variant of his algorithm using CTEs, but it doesn't quite fly fast enough.

    CREATE FUNCTION dbo.fnCoalesceSpaces(@Str varchar(max))

    RETURNS varchar(max) AS

    BEGIN

    ;WITH A AS (SELECT n FROM (VALUES (32),(16),(8),(4),(2),(1)) A(n))

    SELECT @STR = REPLACE(@Str COLLATE Latin1_General_BIN2, SPACE(n + 1), SPACE(1))

    FROM A

    ORDER BY n DESC

    RETURN @STR

    END

    Gary

  • Jeff Moden

    SSC Guru

    Points: 993892

    I believe that the reasons for that are many...

    1. It's a scalar UDF.

    2. It does calculations instead of using constants.

    3. It doesn't take advantage of nested replaces.

    4. It has a relatively very expensive ORDER BY.

    The nested replaces could actually be made into an Inline Table Valued Function as a kind of "iSF" or Inline Scalar Valued Function. Please see the following for more information on that subject.

    http://www.sqlservercentral.com/articles/T-SQL/91724/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • TomThomson

    SSC Guru

    Points: 104748

    Jeff Moden (3/26/2015)


    I believe that the reasons for that are many...

    1. It's a scalar UDF.

    2. It does calculations instead of using constants.

    3. It doesn't take advantage of nested replaces.

    4. It has a relatively very expensive ORDER BY.

    The nested replaces could actually be made into an Inline Table Valued Function as a kind of "iSF" or Inline Scalar Valued Function. Please see the following for more information on that subject.

    http://www.sqlservercentral.com/articles/T-SQL/91724/

    I'm not 100% certain, but I suspect that the ORDER BY is technically meaningless, it can only order the final output rows generated by a query although it may sometimes, if one is lucky, order the internal workings of the query too. So having this UDF produce the right result requires luck - not something I would therefor use in a production environment (because I don't want upgrades - including bugfixes - to the server to risk breaking the code even if I know it currently works - unless I could devise a good reliable method of detecting when things were happening in the wrong order so that I could use a dfferent method in that case, and I can see any efficient method of doing that.

    Writing a user defined table valued function with a single select statement nesting the 6 (or however many you choose) replace calls so that the thing runs inline and using explicit string constants instead of calls on SPACE is going to improve performance rather more than somewhat as well as providing formally valid SQL that will work for ever unless someone changes the standard.

    Tom

  • Alan Burstein

    SSC Guru

    Points: 61019

    I love this article Jeff and have been using your technique since I discovered this article a little more than a year ago.

    I was using your technique to create a function to do this today; out of habit I created an iTVF version (AKA inline scalar version). Out of curiosity I created a scalar version and tested it. As expected the iTVF version was faster (by about 50%). I thought the results were worth sharing.

    CREATE FUNCTION NormalizeWhitespace(@string varchar(max))

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    SELECT cleanstring =

    CASE

    WHEN charindex(' ',@string) > 0 THEN

    REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(@string)),' ',' '+CHAR(127)),CHAR(127)+' ',''),CHAR(127),'')

    ELSE @string

    END;

    GO

    CREATE FUNCTION dbo.NormalizeWhitespace2 (@string varchar(max))

    RETURNS varchar(max) WITH SCHEMABINDING AS

    BEGIN

    RETURN

    CASE

    WHEN charindex(' ',@string) > 0 THEN

    REPLACE(REPLACE(REPLACE(LTRIM(RTRIM(@string)),' ',' '+CHAR(127)),CHAR(127)+' ',''),CHAR(127),'')

    ELSE @string

    END;

    END

    I used varchar(max) because that's what my app needs (keeping it "real world"). Here's the test harness:

    SET NOCOUNT ON

    IF OBJECT_ID('tempdb..#vals') IS NOT NULL DROP TABLE #vals;

    CREATE TABLE #vals (id int identity primary key, val varchar(8000) not null);

    INSERT #vals (val)

    SELECT TOP 500000 REPLACE(REPLACE(newid(),'-',' '),'A',' ')

    FROM sys.all_columns a, sys.all_columns b;

    dbcc freeproccache with no_infomsgs

    dbcc dropcleanbuffers with no_infomsgs

    PRINT '========== itvf version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = cleanstring

    FROM #vals

    CROSS APPLY dbo.NormalizeWhitespace(val);

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    PRINT '========== scalar version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = dbo.NormalizeWhitespace2(val)

    FROM #vals;

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    And the results:

    Beginning execution loop

    ========== itvf version ==========

    500000 rows.

    3496

    ========== itvf version ==========

    500000 rows.

    3210

    ========== itvf version ==========

    500000 rows.

    3256

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    500000 rows.

    4813

    ========== scalar version ==========

    500000 rows.

    4780

    ========== scalar version ==========

    500000 rows.

    4810

    Batch execution completed 3 times.

    Note: COLLATE Latin1_General_BIN made no difference.

    Please forgive me if someone has already done this. Cheers.

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Jeff Moden

    SSC Guru

    Points: 993892

    Alan.B (5/26/2015)


    I love this article Jeff and have been using your technique since I discovered this article a little more than a year ago.

    Thanks, Alan, but you need to revisit the article. Read the Prologue and see the first link there, which leads to Michael's method of using nested replaces just for the spaces. It blows the door off my method and virtually all comers. That's what I like about this community... lot's of people joined in the discussion and the result was a truly nasty fast method that passed my on the highway so fast that it looked like my method was standing still.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • Alan Burstein

    SSC Guru

    Points: 61019

    Jeff Moden (5/26/2015)


    Alan.B (5/26/2015)


    I love this article Jeff and have been using your technique since I discovered this article a little more than a year ago.

    Thanks, Alan, but you need to revisit the article. Read the Prologue and see the first link there, which leads to Michael's method of using nested replaces just for the spaces. It blows the door off my method and virtually all comers. That's what I like about this community... lot's of people joined in the discussion and the result was a truly nasty fast method that passed my on the highway so fast that it looked like my method was standing still.

    I should start reading prologues I guess 😉

    Nonetheless, I tried the same exercise with Michael's function and the iTVF was about 25% faster still. Long live the inline scalar iTVF!

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Jeff Moden

    SSC Guru

    Points: 993892

    Alan.B (5/26/2015)


    Jeff Moden (5/26/2015)


    Alan.B (5/26/2015)


    I love this article Jeff and have been using your technique since I discovered this article a little more than a year ago.

    Thanks, Alan, but you need to revisit the article. Read the Prologue and see the first link there, which leads to Michael's method of using nested replaces just for the spaces. It blows the door off my method and virtually all comers. That's what I like about this community... lot's of people joined in the discussion and the result was a truly nasty fast method that passed my on the highway so fast that it looked like my method was standing still.

    I should start reading prologues I guess 😉

    Nonetheless, I tried the same exercise with Michael's function and the iTVF was about 25% faster still. Long live the inline scalar iTVF!

    Can't go wrong there. Guess I'm going to have to revisit this article. Heh... maybe add a pre-prologue. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • Michael Meierruth

    SSCrazy Eights

    Points: 9991

    Alan B.

    How does your function stand up to a varchar(8000) column with an * at the beginning and end and all blanks in between?

  • Alan Burstein

    SSC Guru

    Points: 61019

    Michael Meierruth (5/27/2015)


    Alan B.

    How does your function stand up to a varchar(8000) column with an * at the beginning and end and all blanks in between?

    Okay, first - both functions are much slower with a * at the beginning/end. My first test was at 500K rows and I cancelled before the first test finished. I did 10K, 50K and 100K rows. Here's DDL, Test and results:

    -- Ran on my Laptop -- X64 (8 Logical CPU, 2.8ghz, 32GB) Microsoft SQL Server 2014 - 12.0.2254.0 (X64)

    USE tempdb

    GO

    -- CREATE THE FUNCTIONS

    -----------------------------------------------------------------

    IF OBJECT_ID('dbo.fn_CleanUp_MichaelMeierruth_MKIV') IS NOT NULL

    DROP FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV

    GO

    CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(max))

    RETURNS VARCHAR(max) WITH SCHEMABINDING AS

    BEGIN

    RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    END;

    GO

    IF OBJECT_ID('dbo.NormalizeWhiteSpace8K') IS NOT NULL

    DROP FUNCTION dbo.NormalizeWhiteSpace8K

    GO

    CREATE FUNCTION dbo.NormalizeWhiteSpace8K(@String varchar(max))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    SELECT

    NewString =

    replace(replace(replace(replace(replace(replace(replace(

    ltrim(rtrim(@String)),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ');

    GO

    -- MAKE SURE BOTH STILL WORK (yep)

    -----------------------------------------------------------------

    DECLARE @string varchar(8000);

    SET @string = '*'+REPLICATE(' ',7998)+'*';

    SELECT * FROM dbo.NormalizeWhiteSpace8K(@string);

    SELECT dbo.fn_CleanUp_MichaelMeierruth_MKIV(@string);

    --THE TESTS

    -----------------------------------------------------------------

    SET NOCOUNT ON

    IF OBJECT_ID('tempdb..#vals') IS NOT NULL DROP TABLE #vals;

    CREATE TABLE #vals (id int identity primary key, val varchar(8000) not null);

    INSERT #vals (val)

    SELECT TOP 100000 '*'+REPLICATE(' ',7998)+'*'

    FROM sys.all_columns a, sys.all_columns b;

    dbcc freeproccache with no_infomsgs

    dbcc dropcleanbuffers with no_infomsgs

    PRINT '========== itvf version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = NewString

    FROM #vals

    CROSS APPLY dbo.NormalizeWhiteSpace8K(val);

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    PRINT '========== scalar version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = dbo.fn_CleanUp_MichaelMeierruth_MKIV(val)

    FROM #vals;

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    /*

    --10,000 Records

    Beginning execution loop

    ========== itvf version ==========

    10000 rows.

    1450

    ========== itvf version ==========

    10000 rows.

    1380

    ========== itvf version ==========

    10000 rows.

    1413

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    10000 rows.

    2300

    ========== scalar version ==========

    10000 rows.

    2290

    ========== scalar version ==========

    10000 rows.

    2276

    Batch execution completed 3 times.

    */

    /*

    -- 50,000 rows

    Beginning execution loop

    ========== itvf version ==========

    50000 rows.

    7166

    ========== itvf version ==========

    50000 rows.

    7116

    ========== itvf version ==========

    50000 rows.

    7070

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    50000 rows.

    11503

    ========== scalar version ==========

    50000 rows.

    11530

    ========== scalar version ==========

    50000 rows.

    11493

    Batch execution completed 3 times.

    */

    /*

    -- 100,000 rows

    Beginning execution loop

    ========== itvf version ==========

    100000 rows.

    14426

    ========== itvf version ==========

    100000 rows.

    14226

    ========== itvf version ==========

    100000 rows.

    14226

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    100000 rows.

    22900

    ========== scalar version ==========

    100000 rows.

    23076

    ========== scalar version ==========

    100000 rows.

    22943

    Batch execution completed 3 times.

    */

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Gary Harding

    Ten Centuries

    Points: 1232

    Alan.B (5/27/2015)


    Michael Meierruth (5/27/2015)


    Alan B.

    How does your function stand up to a varchar(8000) column with an * at the beginning and end and all blanks in between?

    Okay, first - both functions are much slower with a * at the beginning/end. My first test was at 500K rows and I cancelled before the first test finished. I did 10K, 50K and 100K rows. Here's DDL, Test and results:

    -- Ran on my Laptop -- X64 (8 Logical CPU, 2.8ghz, 32GB) Microsoft SQL Server 2014 - 12.0.2254.0 (X64)

    USE tempdb

    GO

    -- CREATE THE FUNCTIONS

    -----------------------------------------------------------------

    IF OBJECT_ID('dbo.fn_CleanUp_MichaelMeierruth_MKIV') IS NOT NULL

    DROP FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV

    GO

    CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(max))

    RETURNS VARCHAR(max) WITH SCHEMABINDING AS

    BEGIN

    RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    END;

    GO

    IF OBJECT_ID('dbo.NormalizeWhiteSpace8K') IS NOT NULL

    DROP FUNCTION dbo.NormalizeWhiteSpace8K

    GO

    CREATE FUNCTION dbo.NormalizeWhiteSpace8K(@String varchar(max))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    SELECT

    NewString =

    replace(replace(replace(replace(replace(replace(replace(

    ltrim(rtrim(@String)),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ');

    GO

    -- MAKE SURE BOTH STILL WORK (yep)

    -----------------------------------------------------------------

    DECLARE @string varchar(8000);

    SET @string = '*'+REPLICATE(' ',7998)+'*';

    SELECT * FROM dbo.NormalizeWhiteSpace8K(@string);

    SELECT dbo.fn_CleanUp_MichaelMeierruth_MKIV(@string);

    --THE TESTS

    -----------------------------------------------------------------

    SET NOCOUNT ON

    IF OBJECT_ID('tempdb..#vals') IS NOT NULL DROP TABLE #vals;

    CREATE TABLE #vals (id int identity primary key, val varchar(8000) not null);

    INSERT #vals (val)

    SELECT TOP 100000 '*'+REPLICATE(' ',7998)+'*'

    FROM sys.all_columns a, sys.all_columns b;

    dbcc freeproccache with no_infomsgs

    dbcc dropcleanbuffers with no_infomsgs

    PRINT '========== itvf version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = NewString

    FROM #vals

    CROSS APPLY dbo.NormalizeWhiteSpace8K(val);

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    PRINT '========== scalar version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = dbo.fn_CleanUp_MichaelMeierruth_MKIV(val)

    FROM #vals;

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    /*

    --10,000 Records

    Beginning execution loop

    ========== itvf version ==========

    10000 rows.

    1450

    ========== itvf version ==========

    10000 rows.

    1380

    ========== itvf version ==========

    10000 rows.

    1413

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    10000 rows.

    2300

    ========== scalar version ==========

    10000 rows.

    2290

    ========== scalar version ==========

    10000 rows.

    2276

    Batch execution completed 3 times.

    */

    /*

    -- 50,000 rows

    Beginning execution loop

    ========== itvf version ==========

    50000 rows.

    7166

    ========== itvf version ==========

    50000 rows.

    7116

    ========== itvf version ==========

    50000 rows.

    7070

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    50000 rows.

    11503

    ========== scalar version ==========

    50000 rows.

    11530

    ========== scalar version ==========

    50000 rows.

    11493

    Batch execution completed 3 times.

    */

    /*

    -- 100,000 rows

    Beginning execution loop

    ========== itvf version ==========

    100000 rows.

    14426

    ========== itvf version ==========

    100000 rows.

    14226

    ========== itvf version ==========

    100000 rows.

    14226

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    100000 rows.

    22900

    ========== scalar version ==========

    100000 rows.

    23076

    ========== scalar version ==========

    100000 rows.

    22943

    Batch execution completed 3 times.

    */

    Surely both of these methods would benefit from using a binary collation, allowing the REPLACE function to execute faster?

    For example, adding COLLATE Latin1_General_BIN2 to the function's inner @string parameter reference gives a significant speed improvement in my tests.

    Maybe there's a downside I haven't thought of, but I'd have thought a space is a space is a space in any collation.

  • Alan Burstein

    SSC Guru

    Points: 61019

    Gary Harding (5/28/2015)


    Alan.B (5/27/2015)


    Michael Meierruth (5/27/2015)


    Alan B.

    How does your function stand up to a varchar(8000) column with an * at the beginning and end and all blanks in between?

    Okay, first - both functions are much slower with a * at the beginning/end. My first test was at 500K rows and I cancelled before the first test finished. I did 10K, 50K and 100K rows. Here's DDL, Test and results:

    -- Ran on my Laptop -- X64 (8 Logical CPU, 2.8ghz, 32GB) Microsoft SQL Server 2014 - 12.0.2254.0 (X64)

    USE tempdb

    GO

    -- CREATE THE FUNCTIONS

    -----------------------------------------------------------------

    IF OBJECT_ID('dbo.fn_CleanUp_MichaelMeierruth_MKIV') IS NOT NULL

    DROP FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV

    GO

    CREATE FUNCTION dbo.fn_CleanUp_MichaelMeierruth_MKIV(@S VARCHAR(max))

    RETURNS VARCHAR(max) WITH SCHEMABINDING AS

    BEGIN

    RETURN replace(replace(replace(replace(replace(replace(replace(ltrim(rtrim(replace(@s,CHAR(9),' '))),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ')

    END;

    GO

    IF OBJECT_ID('dbo.NormalizeWhiteSpace8K') IS NOT NULL

    DROP FUNCTION dbo.NormalizeWhiteSpace8K

    GO

    CREATE FUNCTION dbo.NormalizeWhiteSpace8K(@String varchar(max))

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    SELECT

    NewString =

    replace(replace(replace(replace(replace(replace(replace(

    ltrim(rtrim(@String)),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' '),

    ' ',' ');

    GO

    -- MAKE SURE BOTH STILL WORK (yep)

    -----------------------------------------------------------------

    DECLARE @string varchar(8000);

    SET @string = '*'+REPLICATE(' ',7998)+'*';

    SELECT * FROM dbo.NormalizeWhiteSpace8K(@string);

    SELECT dbo.fn_CleanUp_MichaelMeierruth_MKIV(@string);

    --THE TESTS

    -----------------------------------------------------------------

    SET NOCOUNT ON

    IF OBJECT_ID('tempdb..#vals') IS NOT NULL DROP TABLE #vals;

    CREATE TABLE #vals (id int identity primary key, val varchar(8000) not null);

    INSERT #vals (val)

    SELECT TOP 100000 '*'+REPLICATE(' ',7998)+'*'

    FROM sys.all_columns a, sys.all_columns b;

    dbcc freeproccache with no_infomsgs

    dbcc dropcleanbuffers with no_infomsgs

    PRINT '========== itvf version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = NewString

    FROM #vals

    CROSS APPLY dbo.NormalizeWhiteSpace8K(val);

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    PRINT '========== scalar version ==========';

    DECLARE

    @result varchar(8000),

    @StartTime DATETIME = GETDATE();

    SELECT @result = dbo.fn_CleanUp_MichaelMeierruth_MKIV(val)

    FROM #vals;

    PRINT CAST(@@rowcount AS varchar(10))+' rows.';

    PRINT DATEDIFF(ms,@StartTime,GETDATE());

    GO 3

    /*

    --10,000 Records

    Beginning execution loop

    ========== itvf version ==========

    10000 rows.

    1450

    ========== itvf version ==========

    10000 rows.

    1380

    ========== itvf version ==========

    10000 rows.

    1413

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    10000 rows.

    2300

    ========== scalar version ==========

    10000 rows.

    2290

    ========== scalar version ==========

    10000 rows.

    2276

    Batch execution completed 3 times.

    */

    /*

    -- 50,000 rows

    Beginning execution loop

    ========== itvf version ==========

    50000 rows.

    7166

    ========== itvf version ==========

    50000 rows.

    7116

    ========== itvf version ==========

    50000 rows.

    7070

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    50000 rows.

    11503

    ========== scalar version ==========

    50000 rows.

    11530

    ========== scalar version ==========

    50000 rows.

    11493

    Batch execution completed 3 times.

    */

    /*

    -- 100,000 rows

    Beginning execution loop

    ========== itvf version ==========

    100000 rows.

    14426

    ========== itvf version ==========

    100000 rows.

    14226

    ========== itvf version ==========

    100000 rows.

    14226

    Batch execution completed 3 times.

    Beginning execution loop

    ========== scalar version ==========

    100000 rows.

    22900

    ========== scalar version ==========

    100000 rows.

    23076

    ========== scalar version ==========

    100000 rows.

    22943

    Batch execution completed 3 times.

    */

    Surely both of these methods would benefit from using a binary collation, allowing the REPLACE function to execute faster?

    For example, adding COLLATE Latin1_General_BIN2 to the function's inner @string parameter reference gives a significant speed improvement in my tests.

    Maybe there's a downside I haven't thought of, but I'd have thought a space is a space is a space in any collation.

    I have to try Latin1_General_BIN2. I tested with Latin1_General_BIN and saw no improvement.

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • jessinthomas012

    SSC Rookie

    Points: 46

    Hi Jeff,

    Well explained.. Thanks for sharing this :)..

    Thanks,
    Jessin

  • Jeff Moden

    SSC Guru

    Points: 993892

    Thanks for the feedback, Jessin.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

Viewing 15 posts - 406 through 420 (of 426 total)

You must be logged in to reply to this topic. Login to reply