Help with using mulitple replace in a function

  • Jeff Moden (7/16/2010)


    Julie Breutzmann (7/16/2010)


    I've modified Lowell's function and performed a time-comparison test. Ihaven't thoroughly tested the function, since if it's too slow it won't be useful anyway. I used and input file of about 2000 records and compared on first name, last name, and address line 1 against our database of about 800,000 records. It takes less than 1 second with a simple comparison and over 10 mintes using the function. However, the function DID find 69 matches that the simple comparison did not.

    That's what I was afraid of. There is a way to do this same thing using an iTVF (inline Table Valued Function) and a Cross Apply that should speed things up by (possibly) and order of magnitude. The problem is that I won't be at a machine with SQL Server on it until Monday and any code that I would write would be untested until then.

    This is not a rush project. I'd appreciate anything you can come up with when you're able to get to it.

    Julie

  • I'll try to remember it on Monday, then, Julie. Thanks for your patience.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/15/2010)


    Eugene Elutin (7/14/2010)


    Julie Breutzmann (7/14/2010)


    ...

    What would be an efficient way to code this?

    ...

    If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

    Maybe in this case... maybe not. I've beaten certain CLR functions.

    Any examples? In "some" degree it depends on the developer proficience in C#...

    But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (7/19/2010)


    Jeff Moden (7/15/2010)


    Eugene Elutin (7/14/2010)


    Julie Breutzmann (7/14/2010)


    ...

    What would be an efficient way to code this?

    ...

    If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

    Maybe in this case... maybe not. I've beaten certain CLR functions.

    Any examples? In "some" degree it depends on the developer proficience in C#...

    But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)

    Unfortunately, no. Matt Miller and I went round and round a long time ago and I actually beat some "Regex" CLR's but I can't put my hands on those particular threads anymore. I'm not sure Matt can either but I believe he'll vouch for my claims.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/19/2010)


    Eugene Elutin (7/19/2010)


    Jeff Moden (7/15/2010)


    Eugene Elutin (7/14/2010)


    Julie Breutzmann (7/14/2010)


    ...

    What would be an efficient way to code this?

    ...

    If you need high-performance heavy string manipulation functions write them as CLR functions in c#. T-SQL UDF will always be slower...

    Maybe in this case... maybe not. I've beaten certain CLR functions.

    Any examples? In "some" degree it depends on the developer proficience in C#...

    But again, as per "Never say never", I should keep to "Never say always", you might be right...;-)

    Unfortunately, no. Matt Miller and I went round and round a long time ago and I actually beat some "Regex" CLR's but I can't put my hands on those particular threads anymore. I'm not sure Matt can either but I believe he'll vouch for my claims.

    I believe you. However, Regex is not the fasterst way to do thing in C#, short and elegant - yes, fast - not really - too havy library. The fasterst string manipulations would be byte by byte. Not so elegant, a lot of manual coding, but -very fast. Lets say this: it's not easy to beat proper implementation in C#. It may be possible and I have heard such claims, but have not seen good example yet...

    One more thing, just as example. In C#, Replace function is case sensitive. To do case-insensitive replace, Regex is often used. However the faster option is to do such replacement char-by-char (byte-by-byte). You will be surprised: this will beat in-build case-sensitive Replace function as well in all cases except when nothing to replace:

    http://www.codeproject.com/KB/string/fastestcscaseinsstringrep.aspx

    Would be intresting to compare it with SQL Replace...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (7/19/2010)


    The fasterst string manipulations would be byte by byte. Not so elegant, a lot of manual coding, but -very fast.

    Now, I'll definitely agree with THAT! Done correctly, it will be very fast, indeed! I also agree that it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I agree that this "may" perform better as CLR; however, if I were stuck doing this in TSQL. I would create a blacklist table, that housed the characters I wanted to scrub and its replacement value. This gives you a lot more flexibility and makes the process easier to maintain. You could potentially do something like this. I have not tested large scale performance but this should be better than a scalar function.

    *I am assuming that you have a numbers table*

    USE [tempdb]

    GO

    SET NOCOUNT ON;

    GO

    CREATE TABLE dbo.Blacklist(ExlChar CHAR(1),RepChar CHAR(1));

    INSERT INTO Blacklist VALUES ('.','')

    INSERT INTO Blacklist VALUES (' ','')

    INSERT INTO Blacklist VALUES ('-','')

    INSERT INTO Blacklist VALUES ('_','')

    INSERT INTO Blacklist VALUES ('~','')

    INSERT INTO Blacklist VALUES ('ž','z')

    INSERT INTO Blacklist VALUES ('ó','o')

    INSERT INTO Blacklist VALUES ('š','s')

    INSERT INTO Blacklist VALUES ('í','i')

    INSERT INTO Blacklist VALUES ('é','e')

    --Create sample table

    CREATE TABLE dbo.T(Id INT IDENTITY,col VARCHAR(50));

    INSERT INTO t VALUES ('Haínes')

    INSERT INTO t VALUES ('Haines')

    INSERT INTO t VALUES (' -Haines')

    INSERT INTO t VALUES ('-Haines ')

    GO

    --Create an Inline TVF function

    CREATE FUNCTION dbo.fn_CleanStr()

    RETURNS TABLE

    RETURN(

    SELECT

    t1.Id,

    t1.Col,

    (

    SELECT LTRIM(RTRIM(COALESCE([RepChar],SUBSTRING(LTRIM(RTRIM(col)),n,1))))

    FROM t t2

    INNER JOIN dbo.Numbers n ON n <= LEN(LTRIM(col))

    LEFT JOIN dbo.Blacklist b ON [ExlChar] = SUBSTRING(LTRIM(RTRIM(col)),n,1)

    WHERE t2.Id = t1.Id

    ORDER BY t2.Id, t2.[col], n.n

    FOR XML PATH(''), TYPE

    ).value('.','varchar(max)') AS CleanStr

    FROM t t1

    )

    GO

    --Begin solution

    SELECT fn.*

    FROM dbo.fn_CleanStr() AS fn

    WHERE fn.CleanStr = 'Haines'

  • Jeff Moden (7/20/2010)


    ...

    it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.

    In a simple race single T-SQL REPLACE outperforms CLR simple byte-by-byte version : it is 3 times faster. Which is not suprising as CLR call not coming cheap enough...

    However, for multiple replacements on the same string dedicated CLR replace function which handles multiple replacements in a single call catching up with T-SQL one. After number of replacements reached 15 - CLR starts constantly outperform REPLACE.

    Also, I beleive that CLR function (based on the code from the link supplied in my previos post) can be tuned further...

    using System;

    using System.Collections.Generic;

    using System.Text;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    namespace QuickReplaceX

    {

    public class ReplaceX

    {

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString Replace(string original, string pattern, string replacement)

    {

    return (SqlString)doReplace(original, pattern, replacement);

    }

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString ReplaceMltPattern(string original, string patterns, string replacement, char separator)

    {

    string res = original;

    string[] aPatterns = patterns.Split(separator);

    for (int i = 0; i < aPatterns.Length; i++)

    {

    res = doReplace(res, aPatterns, replacement);

    }

    return (SqlString) res;

    }

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString ReplaceMltAll(string original, string patterns, string replacements, char separator)

    {

    string res = original;

    string[] aPatterns = patterns.Split(separator);

    string[] aReplacements = replacements.Split(separator);

    if (aPatterns.Length != aReplacements.Length)

    {

    throw new ApplicationException("ReplaceMltAll: number of patterns must match number of replacements");

    }

    for (int i = 0; i < aPatterns.Length; i++)

    {

    res = doReplace(res, aPatterns, aReplacements);

    }

    return (SqlString)res;

    }

    private static string doReplace(string original, string pattern, string replacement)

    {

    int count, position0, position1;

    count = position0 = position1 = 0;

    string upperString = original.ToUpper();

    string upperPattern = pattern.ToUpper();

    int inc = (original.Length / pattern.Length) *

    (replacement.Length - pattern.Length);

    char[] chars = new char[original.Length + Math.Max(0, inc)];

    while ((position1 = upperString.IndexOf(upperPattern,

    position0)) != -1)

    {

    for (int i = position0; i < position1; ++i)

    chars[count++] = original;

    for (int i = 0; i < replacement.Length; ++i)

    chars[count++] = replacement;

    position0 = position1 + pattern.Length;

    }

    if (position0 == 0) return original;

    for (int i = position0; i < original.Length; ++i)

    chars[count++] = original;

    return new string(chars, 0, count);

    }

    }

    }

    and to register:

    CREATE ASSEMBLY [QuickReplaceX]

    AUTHORIZATION [dbo]

    FROM '[full path]\QuickReplaceX.dll'

    WITH PERMISSION_SET = SAFE

    GO

    CREATE FUNCTION QReplace(@original NVARCHAR(MAX), @pattern NVARCHAR(MAX), @replacement NVARCHAR(MAX))

    RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[Replace]

    GO

    CREATE FUNCTION QReplaceMltPattern(@original NVARCHAR(MAX), @patterns NVARCHAR(MAX), @replacement NVARCHAR(MAX), @separator nchar(1) )

    RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[ReplaceMltPattern]

    GO

    CREATE FUNCTION QReplaceMltAll(@original NVARCHAR(MAX), @patterns NVARCHAR(MAX), @replacements NVARCHAR(MAX), @separator nchar(1))

    RETURNS NVARCHAR(MAX) EXTERNAL NAME [QuickReplaceX].[QuickReplaceX.ReplaceX].[ReplaceMltAll]

    GO

    Dont forget to enable CLR 🙂

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Thank you Adam. The function that I posted is actually using a partial blacklist table, so I'll look at merging our thoughts.

    Thanks Eugene for your assistance. I'll try to understand and implement your solution, also.

    It may be a while before I have a chance to explore these options. I am suddenly swamped with other things of higher priority. I will post my results back to this forum when I am able to do so.

    Julie

  • Eugene Elutin (7/22/2010)


    Jeff Moden (7/20/2010)


    ...

    it would be interesting to have a simple race between the built in REPLACE (T-SQL) and a CLR.

    In a simple race single T-SQL REPLACE outperforms CLR simple byte-by-byte version : it is 3 times faster. Which is not suprising as CLR call not coming cheap enough...

    However, for multiple replacements on the same string dedicated CLR replace function which handles multiple replacements in a single call catching up with T-SQL one. After number of replacements reached 15 - CLR starts constantly outperform REPLACE.

    Also, I beleive that CLR function (based on the code from the link supplied in my previos post) can be tuned further...

    Thanks for the testing Eugene. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply