Remove GUID within a String Field

  • Hi All,

    I have a string field that contains 1 or more GUID data which needs to be removed. Here's an example:

    Change from:

    “ REFERRED: 0A590C90-0A1C-11D7-9F88-0002A542FBC6 - DR. SEUSS TO 3991A265-2DC0-47BA-860C-97C615BE9BAC - DR. LORAX”

    To:

    “REFERRED: DR. SEUSS TO DR. LORAX”

    Thank you in advance. - mchu

  • Maybe this article will help?

    http://thefirstsql.com/2011/02/07/regular-expressions-advanced-string-matching-and-new-split-function-sql-server-2008-r2/

    The Reg Ex you'll need to use is I think:

    /^\{?[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\}?$/i


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • There's no need for the rigors of setting up any kind of external RegEx to solve this simple problem. While people that know me are likely going to have a heart attack when they see the following code, the following code will just as likely at least tie an external RegEx solution and, yes, it will also squeak past a Tally Table solution because it's 100% done in memory and it's done with machine-language-level range scans instead of the range scan that will occur if you use a Tally Table. It's actually what scalar UDF's were invented for and one of the few places where RBAR actually has the advantage. Of course, I threw in a couple of well known extra performance tricks in the process.

    Here's the code for a function that does what you ask. Yep... have your heart attack now, folks. It uses a While Loop AND a scalar function. 😉

    CREATE FUNCTION dbo.RemoveGuidsAndDashes

    (@SomeString VARCHAR(MAX))

    RETURNS VARCHAR(MAX) WITH SCHEMABINDING

    AS

    BEGIN

    --===== Declare some obviously named variables

    DECLARE @Pattern VARCHAR(MAX),

    @NextPosit INT

    ;

    --===== Create the GUID pattern, get rid of dashes surrounded by spaces,

    -- and find the position of the first GUID if it exists.

    SELECT @Pattern = '%[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]%',

    @SomeString = REPLACE(@SomeString,' - ',''),

    @NextPosit = PATINDEX(@Pattern,@SomeString)

    --===== If the SELECT above found a GUID, STUFF each one out of existance.

    -- This will be skipped if no GUID was found.

    WHILE @NextPosit > 0

    SELECT @SomeString = STUFF(@SomeString,@NextPosit,36,''),

    @NextPosit = PATINDEX(@Pattern,@SomeString)

    ;

    RETURN @SomeString

    ;

    END

    ;

    Here's a usage example which solves the OPs problem.

    SELECT dbo.RemoveGuidsAndDashes('REFERRED: 0A590C90-0A1C-11D7-9F88-0002A542FBC6 - DR. SEUSS TO 3991A265-2DC0-47BA-860C-97C615BE9BAC - DR. LORAX')

    ;

    That produces the following result.

    REFERRED: DR. SEUSS TO DR. LORAX

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Rats! I was looking at PATINDEX and STUFF to solve this but didn't make the leap from RegEx to using the LIKE patterning to identify the GUIDs.

    Then I realized I couldn't do it without a loop, found these new SQL functions that purport to do it (didn't test but would like to) and short-sheeted my reply.

    Have you tested this solution against the RegEx functions that SQL Server 2008 R2 now provides in your 1M row table? I'd be curious about the result.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (3/20/2012)


    Rats! I was looking at PATINDEX and STUFF to solve this but didn't make the leap from RegEx to using the LIKE patterning to identify the GUIDs.

    Then I realized I couldn't do it without a loop, found these new SQL functions that purport to do it (didn't test but would like to) and short-sheeted my reply.

    Have you tested this solution against the RegEx functions that SQL Server 2008 R2 now provides in your 1M row table? I'd be curious about the result.

    Haven't put my hands on R2, yet, so no... haven't done the comparison.

    This CAN be done with a Tally Table but it's a bit slower especially since you still end up with a scalar function due to the variable overlay.

    SELECT @SomeString = STUFF(@SomeString,t.N,36,'')

    FROM dbo.Tally t

    WHERE t.N BETWEEN 1 AND LEN(@SomeString)-35

    AND SUBSTRING(@SomeString,t.N,36) LIKE '%[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f]-[0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f][0-9a-f]%'

    ORDER BY t.N DESC

    ;

    SELECT @SomeString

    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ...

    It's actually what scalar UDF's were invented for and one of the few places where RBAR actually has the advantage. Of course, I threw in a couple of well known extra performance tricks in the process.

    ...

    If you really want a best possible performance for this sort of task, you need to implement it as CLR functions (of cause without using RegEx).

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (3/22/2012)


    ...

    It's actually what scalar UDF's were invented for and one of the few places where RBAR actually has the advantage. Of course, I threw in a couple of well known extra performance tricks in the process.

    ...

    If you really want a best possible performance for this sort of task, you need to implement it as CLR functions (of cause without using RegEx).

    Let's have a race and see how much of a difference there is. If you agree, I'll write some code to populate a million row test table and setup a test harness. You write the CLR and run the test harness.

    I believe what we'll find is that there will be little difference in this case and the scalar UDF might even squeak past the CLR because it doesn't have to go through an API.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ...

    Let's have a race and see how much of a difference there is.

    ...

    NO Jeff! You don't want to do this!

    Ok,

    Here we are...

    1. Test data ( I have around 850,000 records as result...):

    set nocount on;

    --drop table #test

    select 'Referred ' + rn + ': ' + id1 + ' - DR. SEUSS ' + rn + ' TO ' + id2 as val

    into #test

    from (

    select top 1000000 cast(NEWID() as varchar(50)) id1, cast(NEWID() as varchar (50)) id2, cast(ROW_NUMBER() over (order by c1.name) as varchar) rn

    from sys.columns c1, sys.columns c2 ) a

    Now using your function ( SELECT dbo.RemoveGuidsAndDashes(val) valc into #test1 from #test ), 5 runs, I have :

    SQL Server Execution Times:

    MIN: CPU time = 197219 ms, elapsed time = 685073 ms.

    MAX: CPU time = 200984 ms, elapsed time = 737829 ms.

    For right now, I have just created CLR using RegEx:

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using System.Text.RegularExpressions;

    public partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString ReplaceGuidRegEx(SqlString val)

    {

    // Put your code here

    if (val.IsNull) return SqlString.Null;

    return (SqlString) Regex.Replace(val.ToString(), @"({)?([0-9a-fA-F]){8}(-([0-9a-fA-F]){4}){3}-([0-9a-fA-F]){12}(?(1)})",String.Empty);

    }

    };

    Registry assembly and declare function (if some one want to try...):

    CREATE ASSEMBLY CLRPlay from 'C:\_test\CLRPlay\CLRPlay\bin\Release\CLRPlay.dll' WITH PERMISSION_SET = SAFE;

    GO

    CREATE FUNCTION ReplaceGuidRegEx(@val NVARCHAR(MAX)) RETURNS NVARCHAR(MAX)

    AS EXTERNAL NAME CLRPlay.UserDefinedFunctions.ReplaceGuidRegEx;

    GO

    Test in the same way ( SELECT dbo.ReplaceGuidRegEx(val) valc into #test1 from #test ), 5 runs, results:

    SQL Server Execution Times:

    MIN: CPU time = 62828 ms, elapsed time = 66123 ms.

    MAX: CPU time = 64844 ms, elapsed time = 68028 ms.

    Is it big enough difference? Am I doing something wrong with your UDF?

    I will try to build CLR based on byte array) and see if I can beat RegEx ...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • I will definitely try with byte array a bit later, but for now another version with using static instance of RegEx (just added to the same asembly):

    static readonly Regex _regex = new Regex(@"({)?([0-9a-fA-F]){8}(-([0-9a-fA-F]){4}){3}-([0-9a-fA-F]){12}(?(1)})", RegexOptions.Compiled);

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString ReplaceGuidRegExSC(SqlString val)

    {

    // Put your code here

    if (val.IsNull) return SqlString.Null;

    return (SqlString)_regex.Replace(val.ToString(), String.Empty);

    }

    Test 5 times as SELECT dbo.ReplaceGuidRegExSC(val) valc into #test1 from #test

    Results:

    SQL Server Execution Times:

    MIN: CPU time = 41187 ms, elapsed time = 43602 ms.

    MAX: CPU time = 40906 ms, elapsed time = 45441 ms.

    Got it under 1 minute now! :hehe:

    As you can see, it is 30% faster!

    There're another options to explore:

    1. convert input string into lowercase and simplify regex expression (may give another 10%)

    2. Try to use RightToLeft option

    Should I really spend my time on "byte array" custom code?...

    😎

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Thank you all for your assistance and solutions. Since I do not have MDS installed, I will have to use replace function. -mchu

  • You don't need MDS to create and compile c# code...

    http://blogs.oberon.ch/tamberg/2007-10-17/compiling-csharp-without-visual-studio.html

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (3/23/2012)


    ...

    Let's have a race and see how much of a difference there is.

    ...

    NO Jeff! You don't want to do this!

    BWAAA-HAAA!!!! Of course I do. But you beat me to it. 🙂 Thanks for the testing, Eugene, and well done.

    You've made the point for me that I was going to make. A potential 2 to 5 second improvement on something that takes more than a minute to execute might not make it worth it to have separately managed compiled code. I say "potential" because, if you look at the min time for the T-SQL function and the max time for the CLR function in this case, the TSQL function was only a half second slower than the CLR function at least once. 😉

    Well written CLRs can be a godsend if you're really in a pinch but they're not always the performance panacea that a lot of people make them out to be ("If you really want a best possible performance for this sort of task, you need to implement it as CLR functions") . CLR is not always the ultimate answer depending on what you're trying to do and what you value the most. Personally, I'm currently replacing some "well written" CLR audit triggers that take 00:02:39 to execute on a lousy 10,000 rows with some T-SQL code that takes subsecond times. Of course, that's a different type of task but I don't want anyone to think that CLRs are always faster than T-SQL no matter how well written they may be.

    Of course, if you have a DBA that's been badly bitten by the infamous "C3" problem like I have, there's a pretty good chance that DBA simply won't allow CLR functionallity to even be turned on in his/her servers.

    Now, the RegEx capabilities of 2008 R2 is a bit of a different story. It's real hard to turn down a 30% gain in performance.

    Shifting gears a bit, I have found that SET STATISTICS ON will sometimes inflate the duration on certain types of code (I've actually stopped using it for performance testing because it sometimes lies so bad). Could I bother you to run the same tests against SQL Profiler and return DURATION, CPU, READS, WRITES, and ROWCOUNTs instead of using SET SATISTICS? Thanks.

    [font="Arial Black"]C3[/font] = A "Modenism" for "[font="Arial Black"]C[/font]rap [font="Arial Black"]C[/font]ode in a [font="Arial Black"]C[/font]LR"

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply