do you have a StripNonNumeric ITVF function?

  • Lowell

    SSC Guru

    Points: 323331

    we all know Inline Table Value Functions perform at least a couple of order of magnitude faster than a scalar function, but there are some functions I have in my collection that are scalar, that , so far, I don't have an ITVF equivilent for.

    I've got a slow running process, that i found is using a scalar function to strip out non numeric characters; my knee jerk reaction was "I'll just replace that and change to a cross apply /outer apply to do the same thing!Q, and then realized i wasn't making any headway on modifying an existing example to become an ITVF.

    before i reinvent the wheel, does anyone have one?

    here's an example of a scalar function, that is featuring a Tally Table to build the "right" string. run that on a million rows, and it bogs things down a wee bit. :w00t:

    IF OBJECT_ID('[dbo].[StripNonNumeric]') IS NOT NULL

    DROP FUNCTION [dbo].[StripNonNumeric]

    GO

    CREATE FUNCTION StripNonNumeric(@OriginalText VARCHAR(8000))

    RETURNS VARCHAR(8000)

    BEGIN

    DECLARE @CleanedText VARCHAR(8000)

    ;WITH tally (N) as

    (SELECT TOP 10000 row_number() OVER (ORDER BY sc1.id)

    FROM Master.dbo.SysColumns sc1

    CROSS JOIN Master.dbo.SysColumns sc2)

    SELECT @CleanedText = ISNULL(@CleanedText,'') +

    CASE

    --ascii numbers are 48(for '0') thru 57 (for '9')

    WHEN ASCII(SUBSTRING(@OriginalText,Tally.N,1)) BETWEEN 48 AND 57 THEN SUBSTRING(@OriginalText,Tally.N,1) ELSE '' END

    FROM tally WHERE Tally.N <= LEN(@OriginalText)

    RETURN @CleanedText

    END

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Sean Lange

    SSC Guru

    Points: 286366

    Taking your fine code as a starting point I came up with this. I didn't test against a million rows but I think maybe I will do that now.

    CREATE FUNCTION StripNonNumeric_itvf(@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    select STUFF(

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    FOR XML PATH('')

    ), 1 ,0 , '') as CleanedText

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Lowell

    SSC Guru

    Points: 323331

    smacking myself in the head!

    I didn't think of using FOR XML to handle the concatenation; that, sir, is beautiful and inspired.

    You'd think i'd know better, so double kudos to you.

    i'll test it on my process, but I'm pretty sure it'll be quicker than what is there now.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Sean Lange

    SSC Guru

    Points: 286366

    Lowell (6/25/2014)


    smacking myself in the head!

    I didn't think of using FOR XML to handle the concatenation; that, sir, is beautiful and inspired.

    You'd think i'd know better, so double kudos to you.

    i'll test it on my process, but I'm pretty sure it'll be quicker than what is there now.

    Awesome. Glad that will work. Sometime it is easier to see a solution when aren't familiar with the current process. I have added that to my own toolbox as that could definitely come in handy at some point.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Alan Burstein

    SSC Guru

    Points: 61006

    Gents

    First: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.

    on my laptop (Windows 8, 2 cores/4 logical cpu/16GB run SQL Server 2014 Ent) I ran your solution (with results on) and it takes 31 seconds complete (my test harness data below). I tried to improve the time in your query but had no success.

    I have an ngrams function that I use for this kind of thing. Below is the function with my notes on how to use it:

    IF OBJECT_ID('tempdb.dbo.nGrams8K') IS NOT NULL DROP FUNCTION dbo.nGrams8K

    GO

    CREATE FUNCTION dbo.nGrams8K (@string varchar(8000), @k int)

    /********************************************************************

    Created by: Alan Burstein

    Created on: 3/10/2014

    Last Updated on: 5/22/2015

    n-gram defined:

    In the fields of computational linguistics and probability,

    an n-gram is a contiguous sequence of n items from a given

    sequence of text or speech. The items can be phonemes, syllables,

    letters, words or base pairs according to the application.

    For more information see: http://en.wikipedia.org/wiki/N-gram

    Use:

    Outputs a stream of tokens based on an input string.

    Similar to mdq.nGrams:

    http://msdn.microsoft.com/en-us/library/ff487027(v=sql.105).aspx.

    Except it only returns characters as long as K.

    nGrams8K also includes the position of the "Gram" in the string.

    Examples of how to use included as comments after the code.

    ********************************************************************/

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    WITH

    E1(N) AS (SELECT 1 FROM (VALUES (null),(null),(null),(null),(null)) x(n)),

    E3(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b CROSS JOIN E1 c),

    iTally(N) AS

    (

    SELECT TOP (LEN(@string)-(@k-1)) CHECKSUM(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)))

    FROM E3 a CROSS JOIN E3 b

    )

    SELECT

    position = N,

    token = SUBSTRING(@string,N,@k)

    FROM iTally;

    GO

    /********************************************************************

    (1) Basic Use

    -----------------------------------------------------------

    -- (A) With @k as 1,2 & 3:

    SELECT position, token FROM dbo.nGrams8K('abcdefg',1);

    SELECT position, token FROM dbo.nGrams8K('abcdefg',2);

    SELECT position, token FROM dbo.nGrams8K('abcdefg',3);

    -- (B) Using variables

    DECLARE

    @string varchar(20) = 'abcdefg12345', @tokenLen tinyint = 3;

    SELECT position, token FROM dbo.nGrams8K(@string,@tokenLen)

    GO

    (2) Character Count (including 0 Counts)

    -----------------------------------------------------------

    -- (A) Basic character count

    SELECT token = ISNULL(token,'Total:'), count(*)

    FROM dbo.nGrams8K(newid(),1)

    GROUP BY token

    WITH ROLLUP; -- Added rollup for display

    -- (B) Character Account including gaps

    DECLARE

    @alphabet VARCHAR(26)='ABCDEFGHIJKLMNOPQRSTUVWXYZ',

    @string VARCHAR(100)='The quick green fox jumps over the lazy dog and the lazy dogs.';

    SELECT a.token, COUNT(b.token) ttl

    FROM dbo.nGrams8K(@alphabet,1) a

    LEFT JOIN dbo.nGrams8K(@string,1) b ON a.token=b.token

    GROUP BY a.token

    ORDER BY a.token; -- not required, for display only

    GO

    -- (C) Let's try French ordered by most frequent

    DECLARE

    @alphabet VARCHAR(36)='abcdefghijklmnopqrstuvwxyzéèçëòôöùàâ',

    @string VARCHAR(100)='Le renard vert rapide saute par dessus le chien paresseux et le chien paresseux juste posé là.';

    WITH charcount AS

    (

    SELECT a.token, COUNT(b.token) ttl

    FROM dbo.nGrams8K(@alphabet,1) a

    LEFT JOIN dbo.nGrams8K(@string,1) b ON a.token=b.token

    GROUP BY a.token

    )

    SELECT * FROM charcount ORDER BY ttl DESC;

    GO

    (3) a *SIMPLE* word count

    -----------------------------------------------------------

    DECLARE @string varchar(100)='THE QUICK Green FOX JUMPED OVER THE LAZY DOGS BACK';

    SELECT @string AS string, count(*)+1 AS words

    FROM dbo.nGrams8K(@string,1)

    WHERE [token]=' '

    GO

    (4) search for occurances and location of a substring

    -----------------------------------------------------------

    DECLARE

    @string VARCHAR(100)='The green fox jumps over the dog and the dog just laid there.',

    @searchString VARCHAR(100)='The';

    SELECT position, token AS searchString

    FROM dbo.nGrams8K(@string,LEN(@searchstring)) b

    WHERE token=@searchString;

    GO

    (5) Strip non-numeric characters from a string

    -----------------------------------------------------------

    DECLARE @string varchar(100) = 'abc123xyz555xxx999!';

    WITH stripNonNumeric(cleanstring) AS

    (

    SELECT token+''

    FROM dbo.nGrams8K(@string,1)

    WHERE token LIKE '[0-9]'

    FOR XML PATH('')

    )

    SELECT cleanstring

    FROM stripNonNumeric;

    (6) Find all occurances of a pattern in a string

    ------------------------------------------------------------

    DECLARE

    @string varchar(100) = 'zz12x345xxx555abc1234zz5xxx',

    @pat varchar(100) = '[a-z][a-z][0-9]',

    @len int = 3;

    SELECT start_pos = position, token

    FROM dbo.nGrams8K(@string,@len)

    WHERE token LIKE @pat

    SELECT * FROM dbo.findpat8k(@string,@pat,3);

    GO

    (7) Find the longest common substring between 2 strings

    ------------------------------------------------------------

    -- (A) The Function

    IF EXISTS

    (

    SELECT * FROM INFORMATION_SCHEMA.ROUTINES

    WHERE ROUTINE_SCHEMA='dbo' AND ROUTINE_NAME = 'lcss'

    )

    DROP FUNCTION dbo.lcss;

    GO

    CREATE FUNCTION dbo.lcss(@string1 varchar(100), @string2 varchar(100))

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH iTally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM dbo.nGrams8K

    (

    CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string1

    ELSE @string2 END,1

    )

    )

    SELECT TOP (1) with ties token

    FROM iTally

    CROSS APPLY

    dbo.nGrams8K

    (

    CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string1

    ELSE @string2 END, n

    )

    WHERE N <=

    CASE WHEN LEN(@string1)<=LEN(@string2) THEN LEN(@string1)

    ELSE LEN(@string2) END

    AND charindex

    (

    token,

    CASE WHEN LEN(@string1)<=LEN(@string2) THEN @string2

    ELSE @string1 END

    ) > 0

    ORDER BY len(token) DESC;

    GO

    -- (B) example of how to use lcss

    DECLARE

    @string1 varchar(100) = '999xxyyyzaa99xxyyyzz',

    @string2 varchar(100) = '000xxxyyyzzz';

    SELECT string1 = @string1, string2 = @string2, token

    FROM dbo.lcss(@string1, @string2);

    ********************************************************************/

    GO

    I ran the following 10K Row Test:

    SET NOCOUNT ON;

    DBCC FREEPROCCACHE WITH NO_INFOMSGS

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    PRINT 'Sean'+char(13);

    SELECT CleanedText

    FROM #strings

    CROSS APPLY dbo.StripNonNumeric_itvf(string);

    PRINT 'Alan'+char(13);

    SELECT CleanedText

    FROM #strings

    CROSS APPLY dbo.StripNonNumeric_itvf_ajb(string);

    SET STATISTICS IO OFF;

    SET STATISTICS TIME OFF;

    ... and here's the results:

    Sean

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table '#strings____________________________________________________________________________________________________________00000000004C'.

    Scan count 1, logical reads 63, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 31172 ms, elapsed time = 31213 ms.

    Alan

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table '#strings____________________________________________________________________________________________________________00000000004C'.

    Scan count 1, logical reads 63, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 328 ms, elapsed time = 384 ms.

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    I have no idea why my NGrams solution is so much faster considering that we are essentially doing the same thing. I included the query plans for both.

    Lowell: I hope this helps.

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Sean Lange

    SSC Guru

    Points: 286366

    Alan.B (6/25/2014)


    Gents

    First: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.

    Nice. I didn't see your test data and haven't had a chance to roll my own yet. I am guessing that the big difference is that my code uses STUFF and yours does not.

    If you still have your test harness handy try with this.

    ALTER FUNCTION [dbo].[StripNonNumeric_itvf](@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , clean_text(CleanedText) as

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    FOR XML PATH('')

    )

    select CleanedText

    from clean_text

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Alan Burstein

    SSC Guru

    Points: 61006

    Sean Lange (6/25/2014)


    Alan.B (6/25/2014)


    Gents

    First: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.

    Nice. I didn't see your test data and haven't had a chance to roll my own yet. I am guessing that the big difference is that my code uses STUFF and yours does not.

    If you still have your test harness handy try with this.

    ALTER FUNCTION [dbo].[StripNonNumeric_itvf](@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , clean_text(CleanedText) as

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    FOR XML PATH('')

    )

    select CleanedText

    from clean_text

    What you posted was getting the same results (31 seconds)... I added this to the where clause:

    AND n <= len(@OriginalText)[/code]

    Final Version

    ALTER FUNCTION [dbo].[StripNonNumeric_itvf_sean2](@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , clean_text(CleanedText) as

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    AND n <= len(@OriginalText) -- added by ajb

    FOR XML PATH('')

    )

    select CleanedText

    from clean_text

    GO

    It now runs in Milliseconds, same speed as mine. ~30 seconds for a million rows on my system using the test harness I posted earlier.

    Edit: Some of my code got turned into HTML tags. Underlined my change.

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Sean Lange

    SSC Guru

    Points: 286366

    Alan.B (6/25/2014)


    Sean Lange (6/25/2014)


    Alan.B (6/25/2014)


    Gents

    First: Dang it Sean - you beet me to it! I had a very similar solution but you posted yours first. That said, I did do a performance test and ran into something I cannot figure out.

    Nice. I didn't see your test data and haven't had a chance to roll my own yet. I am guessing that the big difference is that my code uses STUFF and yours does not.

    If you still have your test harness handy try with this.

    ALTER FUNCTION [dbo].[StripNonNumeric_itvf](@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , clean_text(CleanedText) as

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    FOR XML PATH('')

    )

    select CleanedText

    from clean_text

    What you posted was getting the same results (31 seconds)... I added this to the where clause:

    AND n <= len(@OriginalText)

    Final Version

    ALTER FUNCTION [dbo].[StripNonNumeric_itvf_sean2](@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    , clean_text(CleanedText) as

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    AND n <= len(@OriginalText) -- added by ajb

    FOR XML PATH('')

    )

    select CleanedText

    from clean_text

    GO

    It now runs in Milliseconds, same speed as mine. ~30 seconds for a million rows on my system using the test harness I posted earlier.

    Edit: Some of my code got turned into HTML tags.

    That makes sense. No point evaluating all the 10,000 rows when they won't meet the condition anyway. I suspect if we changed my original (with the STUFF) it would perform much the same with the inclusion of the additional predicate in the where clause. Awesome work Alan.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Alan Burstein

    SSC Guru

    Points: 61006

    That makes sense. No point evaluating all the 10,000 rows when they won't meet the condition anyway. I suspect if we changed my original (with the STUFF) it would perform much the same with the inclusion of the additional predicate in the where clause. Awesome work Alan.

    Thanks Sean!

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • mickyT

    SSChampion

    Points: 10360

    Hi

    Interestingly enough I tried a similar test on 2012 for the originals and got elapsed times that so close as to be the same. It may have been the strings I was using though.

    Bring on a PATREPLACE function 😀

    ChrisM's pattern splitter from Dwain's article [/url]may also be able to be used here

    select cleanedstring

    FROM #strings s

    CROSS APPLY (

    SELECT Item + '' FROM PatternSplitCM(string,'[^0-9]') WHERE MATCHED = 0 FOR XML PATH ('')

    ) x (cleanedstring)

  • Alan Burstein

    SSC Guru

    Points: 61006

    mickyT (6/25/2014)


    Hi

    Interestingly enough I tried a similar test on 2012 for the originals and got elapsed times that so close as to be the same. It may have been the strings I was using though.

    Bring on a PATREPLACE function 😀

    ChrisM's pattern splitter from Dwain's article [/url]may also be able to be used here

    select cleanedstring

    FROM #strings s

    CROSS APPLY (

    SELECT Item + '' FROM PatternSplitCM(string,'[^0-9]') WHERE MATCHED = 0 FOR XML PATH ('')

    ) x (cleanedstring)

    I recently came up with an iTVF PatExclude function that removes characters based on a pattern and then I remembered this post. Check out the examples in the code...

    CREATE FUNCTION dbo.PatExclude8K

    ( @string varchar(8000),

    @exclude_pattern varchar(50)

    )

    RETURNS TABLE

    AS

    /*

    -- remove letters

    SELECT * FROM dbo.PatExclude8K('abc123!', '[a-z]');

    -- remove numbers

    SELECT * FROM dbo.PatExclude8K('abc123!', '[0-9]');

    -- only include letters and numbers

    SELECT * FROM dbo.PatExclude8K('###abc123!!!', '[^0-9a-z]');

    -- Remove spaces

    SELECT * FROM dbo.PatExclude8K('XXX 123 ZZZ', '['+char(32)+']');

    -- only include letters and "!, ? or ."

    SELECT * FROM dbo.PatExclude8K('123# What?!... ', '[^a-z!?.]')

    */

    RETURN

    WITH

    E1(n) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b), --10E+2 or 100 rows

    E4(n) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b), --10E+4 or 10,000 rows max

    iTally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    stringToTable AS

    (

    SELECT TOP (len(@string)) n, s = substring(@string,n,1)

    FROM iTally

    WHERE n <= len(@string)

    ),

    newstring(newstring) AS

    (

    SELECT s+''

    FROM stringToTable

    WHERE s NOT LIKE @exclude_pattern

    ORDER BY n

    FOR XML PATH('')

    )

    SELECT newstring

    FROM newstring;

    GO

    Next I tested it against PatternSplitCM for removing non-numeric....

    100K row test harness:

    IF OBJECT_ID('tempdb..#val') IS NOT NULL DROP TABLE #val;

    CREATE TABLE #val (txt varchar(36) primary key);

    INSERT INTO #val

    SELECT TOP (100000) NEWID()

    FROM sys.all_columns a CROSS JOIN sys.all_columns b;

    GO

    dbcc freeproccache;

    dbcc dropcleanbuffers;

    DECLARE @string varchar(50);

    SET STATISTICS TIME ON;

    PRINT 'PatExcludeAB:';

    SELECT @string = newstring

    FROM #val

    CROSS APPLY dbo.PatExclude8K(txt, '[^0-9]');

    PRINT 'PatternSplitCM:';

    select @string = cleanedstring

    FROM #val s

    CROSS APPLY

    (

    SELECT Item + ''

    FROM dbo.PatternSplitCM(txt,'[^0-9]')

    WHERE MATCHED = 0 FOR XML PATH ('')

    ) x (cleanedstring);

    SET STATISTICS TIME OFF;

    GO

    Results:

    PatExcludeAB:

    SQL Server Execution Times:

    CPU time = 2250 ms, elapsed time = 2271 ms.

    PatternSplitCM:

    SQL Server Execution Times:

    CPU time = 7109 ms, elapsed time = 7274 ms[/code]

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Jeff Moden

    SSC Guru

    Points: 993661

    Out of all of these submittals, which post has the fastest code? I'd check for myself but I'm on the run right now but want to check the "best" against an old script that I have for this.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • Alan Burstein

    SSC Guru

    Points: 61006

    Jeff Moden (10/27/2014)


    Out of all of these submittals, which post has the fastest code? I'd check for myself but I'm on the run right now but want to check the "best" against an old script that I have for this.

    Basically we have four:

    The first solution Sean posted, my solution using nGrams, Micky's using paternSplitCM and the one I posted today. I wrapped them all in functions:

    -----------------------------------------------------------------------

    -- (1) Sean's origninal solution

    -----------------------------------------------------------------------

    CREATE FUNCTION dbo.StripNonNumeric_itvf(@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH

    E1(N) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    Tally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    )

    select STUFF(

    (

    SELECT SUBSTRING(@OriginalText, t.N, 1)

    FROM tally t

    WHERE ASCII(SUBSTRING(@OriginalText, t.N, 1)) BETWEEN 48 AND 57

    FOR XML PATH('')

    ), 1 ,0 , '') as CleanedText;

    -----------------------------------------------------------------------

    -- (2) My Solution using nGrams8K

    -----------------------------------------------------------------------

    ;-- nGrams

    CREATE FUNCTION [dbo].[nGrams8K]

    (

    @string VARCHAR(8000),

    @n TINYINT,

    @pad BIT=0

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    WITH

    E1(n) AS (SELECT 1 FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) t(n)),

    E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),

    E4(n) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b),

    iTally(n) AS

    (

    SELECT TOP (len(@string)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM E4

    ),

    NewString(NewString) AS

    (

    SELECT REPLICATE(CASE @pad WHEN 0 THEN '' ELSE ' ' END,@n-1)+@string+

    REPLICATE(CASE @pad WHEN 0 THEN '' ELSE ' ' END,@n-1)

    )

    SELECT n AS sequence,

    SUBSTRING(NewString,n,@n) AS token

    FROM iTally

    CROSS APPLY NewString

    WHERE n < ((@n)+LEN(@string));

    GO

    -- stripNonNumeric_itvf_ajb

    CREATE FUNCTION dbo.StripNonNumeric_itvf_ajb(@OriginalText VARCHAR(8000))

    RETURNS TABLE WITH SCHEMABINDING AS

    return

    WITH ngrams AS

    (

    SELECT n = [sequence],

    c = token

    FROM dbo.nGrams8K(@OriginalText,1,1)

    ),

    clean_txt(CleanedText) AS

    (

    SELECT c+''

    FROM ngrams

    WHERE ascii(substring(@OriginalText,n,1)) BETWEEN 48 AND 57

    FOR XML PATH('')

    )

    SELECT CleanedText

    FROM clean_txt;

    GO

    -----------------------------------------------------------------------

    -- (3) Solution Using PatExclude8K

    -----------------------------------------------------------------------

    CREATE FUNCTION dbo.PatExclude8K

    ( @string varchar(8000),

    @exclude_pattern varchar(50)

    )

    RETURNS TABLE

    AS

    RETURN

    WITH

    E1(n) AS (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))dt(n)),

    E2(n) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b), --10E+2 or 100 rows

    E4(n) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b), --10E+4 or 10,000 rows max

    iTally(N) AS

    (

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    stringToTable AS

    (

    SELECT TOP (len(@string)) n, s = substring(@string,n,1)

    FROM iTally

    WHERE n <= len(@string)

    ),

    newstring(newstring) AS

    (

    SELECT s+''

    FROM stringToTable

    WHERE s NOT LIKE @exclude_pattern

    ORDER BY n

    FOR XML PATH('')

    )

    SELECT newstring

    FROM newstring;

    GO

    -----------------------------------------------------------------------

    -- (4) Using PatternSplitCM

    -----------------------------------------------------------------------

    CREATE FUNCTION dbo.StripNonNumeric_itvf_pscm(@OriginalText VARCHAR(8000))

    RETURNS TABLE

    --WITH SCHEMABINDING AS

    return

    WITH newstring(newstring)

    AS

    (

    SELECT item+''

    FROM PatternSplitCM(@OriginalText, '[^0-9]')

    WHERE [matched] = 0

    FOR XML PATH('')

    )

    SELECT newstring

    FROM newstring;

    GO

    -----------------------------------------------------------------------

    -- Make sure they work

    -----------------------------------------------------------------------

    SELECT *

    FROM dbo.StripNonNumeric_itvf('abc123!');

    SELECT *

    FROM dbo.StripNonNumeric_itvf_pscm('abc123!');

    SELECT *

    FROM dbo.StripNonNumeric_itvf_ajb('abc123!');

    SELECT *

    FROM dbo.PatExclude8K('abc123!', '[^0-9]');

    then created this test harness...

    -----------------------------------------------------------------------

    -- 100K ROW TEST HARNESS

    -----------------------------------------------------------------------

    IF OBJECT_ID('tempdb..#val') IS NOT NULL DROP TABLE #val;

    CREATE TABLE #val (txt varchar(36) primary key);

    INSERT INTO #val

    SELECT TOP (100000) NEWID()

    FROM sys.all_columns a CROSS JOIN sys.all_columns b;

    GO

    set nocount on;

    dbcc freeproccache;

    dbcc dropcleanbuffers;

    DECLARE @string varchar(50);

    SET STATISTICS TIME ON;

    PRINT 'Original (skipped; runs 30+ seconds for 10K rows):';

    --SELECT CleanedText

    --FROM #val

    --CROSS APPLY dbo.StripNonNumeric_itvf(txt);

    PRINT 'Using PatternSplitCM:';

    SELECT @string = newstring

    FROM #val

    CROSS APPLY dbo.StripNonNumeric_itvf_pscm(txt);

    PRINT 'Using nGrams:';

    SELECT @string = CleanedText

    FROM #val

    CROSS APPLY dbo.StripNonNumeric_itvf_ajb(txt);

    PRINT 'Using PatExclude8K:';

    SELECT @string = newstring

    FROM #val

    CROSS APPLY dbo.PatExclude8K(txt, '[^0-9]');

    SET STATISTICS TIME OFF;

    GO

    the results...

    Original (skipped; runs 30+ seconds for 10K rows):

    Using PatternSplitCM:

    SQL Server Execution Times:

    CPU time = 7187 ms, elapsed time = 7239 ms.

    Using nGrams:

    SQL Server Execution Times:

    CPU time = 2594 ms, elapsed time = 2591 ms.

    PatExclude8K:

    SQL Server Execution Times:

    CPU time = 2266 ms, elapsed time = 2276 ms.

    The results were pretty much the same when tested against hot or cold cache.

    Note: the first 2 functions are the original code from a couple months back.

    Edit: fixed print statement, added note

    -- Alan Burstein


    Helpful links:Best practices for getting help on SQLServerCentral -- Jeff ModenHow to Post Performance Problems -- Gail ShawNasty fast set-based string manipulation functions:For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)To split strings based on patterns try PatternSplitCMNeed to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8KI cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code.  -- Itzik Ben-Gan 2001

  • Jeff Moden

    SSC Guru

    Points: 993661

    I took the two contenders from your good post, Alan, and added on of my own. Here's the code for the function I use for such things. Yeah... you'll be shocked. It's not only a scalar function but it also has a WHILE loop in it.

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE FUNCTION [dbo].[CleanString]

    /*******************************************************************************

    Purpose:

    Given a string and a pattern of characters to remove, remove the patterned

    characters from the string.

    Usage:

    --===== Basic Syntax Example

    SELECT CleanedString = dbo.CleanString(@pSomeString,@pPattern)

    ;

    --===== Remove all but Alpha characters

    SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^A-Za-z]%');

    FROM dbo.SomeTable st

    ;

    --===== Remove all but Numeric digits

    SELECT CleanedString = dbo.CleanString(st.SomeString,'%[^0-9]%');

    FROM dbo.SomeTable st

    ;

    Programmer Notes:

    1. @pPattern is case sensitive.

    2. The pattern set of characters must be for just one character.

    Revision History:

    Rev 00 - Circa 2007 - George Mastros?

    - Initial find on the web

    Rev 01 - 29 Mar 2007 - Jeff Moden

    - Optimize to remove one instance of PATINDEX from the loop.

    - Add code to use the pattern as a parameter.

    Rev 02 - 26 May 2013 - Jeff Moden

    - Add case sensitivity

    *******************************************************************************/

    (@pString VARCHAR(8000),@pPattern VARCHAR(100))

    RETURNS VARCHAR(8000) AS

    BEGIN

    DECLARE @Pos SMALLINT;

    SELECT @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

    WHILE @Pos > 0

    SELECT @pString = STUFF(@pString,@Pos,1,''),

    @Pos = PATINDEX(@pPattern,@pString COLLATE Latin1_General_BIN);

    RETURN @pString;

    END

    ;

    Here's the test harness that I used. It runs each function through the 100K row table 3 times.

    --===== Create the 100K row test table

    IF OBJECT_ID('tempdb..#val') IS NOT NULL DROP TABLE #val

    ;

    SELECT TOP 100000

    txt = ISNULL(CONVERT(VARCHAR(36),NEWID()),'')

    INTO #val

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    ALTER TABLE #Val

    ADD PRIMARY KEY CLUSTERED (txt)

    ;

    --===== Do the tests. Had to use duration because one

    -- of the tests is on the new scalar function and

    -- SET STATISTICS doesn't report on those correctly.

    GO

    PRINT '========== Using nGrams ==========';

    DECLARE @String VARCHAR(36)

    ,@StartTime DATETIME

    ;

    SELECT @StartTime = GETDATE()

    ;

    SELECT @string = CleanedText

    FROM #val

    CROSS APPLY dbo.StripNonNumeric_itvf_ajb(txt)

    ;

    PRINT DATEDIFF(ms,@StartTime,GETDATE())

    ;

    GO 3

    PRINT '========== Using PatExclude8K ==========';

    DECLARE @String VARCHAR(36)

    ,@StartTime DATETIME

    ;

    SELECT @StartTime = GETDATE()

    ;

    SELECT @string = newstring

    FROM #val

    CROSS APPLY dbo.PatExclude8K(txt, '[^0-9]')

    ;

    PRINT DATEDIFF(ms,@StartTime,GETDATE())

    ;

    GO 3

    PRINT '========== Jeff''s Old Scalar Loop Function ==========';

    DECLARE @String VARCHAR(36)

    ,@StartTime DATETIME

    ;

    SELECT @StartTime = GETDATE()

    ;

    SELECT @String = dbo.CleanString(txt,'%[^0-9]%')

    FROM #val

    ;

    PRINT DATEDIFF(ms,@StartTime,GETDATE())

    ;

    GO 3

    Here are the run results. Again, you'll be shocked. This is the one place where I've not been able to make a Tally Table solution able to beat it. Lord knows I and other good folks have tried.

    (100000 row(s) affected)

    Beginning execution loop

    ========== Using nGrams ==========

    2916

    ========== Using nGrams ==========

    2893

    ========== Using nGrams ==========

    2890

    Batch execution completed 3 times.

    Beginning execution loop

    ========== Using PatExclude8K ==========

    2703

    ========== Using PatExclude8K ==========

    2640

    ========== Using PatExclude8K ==========

    2653

    Batch execution completed 3 times.

    Beginning execution loop

    ========== Jeff's Old Scalar Loop Function ==========

    2413

    ========== Jeff's Old Scalar Loop Function ==========

    2500

    ========== Jeff's Old Scalar Loop Function ==========

    2466

    Batch execution completed 3 times.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
    "If you think its expensive to hire a professional to do the job, wait until you hire an amateur."--Red Adair
    "Change is inevitable... change for the better is not."
    When you put the right degree of spin on it, the number 3|8 is also a glyph that describes the nature of a DBAs job. 😉

    Helpful Links:
    How to post code problems

  • Eirikur Eiriksson

    SSC Guru

    Points: 182321

    Here is my attempt from the train journey to work this morning, looks like it will give the while loop a real run for the money.

    😎

    /********************************************************************

    -- Stripping out any non-numerical characters

    -- EE 2014-10-28

    ********************************************************************/

    CREATE FUNCTION dbo.STRIP_NUM_EE

    (

    @INSTR VARCHAR(8000)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN

    WITH T(N) AS (SELECT N FROM ( VALUES (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)) AS X(N))

    ,NUMS(N) AS (SELECT TOP(CONVERT(BIGINT,LEN(@INSTR),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N FROM T T1,T T2,T T3,T T4)

    SELECT

    (

    SELECT

    CASE WHEN CONVERT(CHAR(1),(ASCII(SUBSTRING(@INSTR COLLATE Latin1_General_BIN,NM.N,1)) - 48),0)

    = SUBSTRING(@INSTR COLLATE Latin1_General_BIN,NM.N,1) THEN SUBSTRING(@INSTR,NM.N,1) END

    FROM NUMS NM

    FOR XML PATH('')

    ) AS OUT_STR

    ;

    Test results

    Beginning execution loop

    ========== Using nGrams ==========

    1640

    ========== Using nGrams ==========

    1640

    ========== Using nGrams ==========

    1640

    Batch execution completed 3 times.

    Beginning execution loop

    ========== Using PatExclude8K ==========

    2686

    ========== Using PatExclude8K ==========

    2730

    ========== Using PatExclude8K ==========

    2700

    Batch execution completed 3 times.

    Beginning execution loop

    ========== Jeff's Old Scalar Loop Function ==========

    1556

    ========== Jeff's Old Scalar Loop Function ==========

    1563

    ========== Jeff's Old Scalar Loop Function ==========

    1546

    Batch execution completed 3 times.

    Beginning execution loop

    ========== dbo.STRIP_NUM_EE Function ==========

    1490

    ========== dbo.STRIP_NUM_EE Function ==========

    1513

    ========== dbo.STRIP_NUM_EE Function ==========

    1473

    Batch execution completed 3 times.

Viewing 15 posts - 1 through 15 (of 46 total)

You must be logged in to reply to this topic. Login to reply