Function for String Occurences Count

  • Comments posted to this topic are about the item Function for String Occurences Count

  • Nice but can't help but feel it's overcomplicating it a bit. What about:

    DECLARE @ReturnValue AS INT

    SET @ReturnValue = 0

    IF LEN(@SearchString) > 0

    BEGIN

    SET @ReturnValue = (LEN(@InputString) - LEN(REPLACE(@InputString,@SearchString,''))) / LEN(@SearchString)

    END

    RETURN @ReturnValue

  • Great.....Super modification...Thanks for your input........

  • erlend.miller 3298 - I love the simplicity of your approach! What if we add COALESCE(NULLIF()) to the calculation to take care of the potential divide by zero error if the @SearchString is passed in as an empty string? This eliminates the need for the IF statement and further simplifies your solution. In practice we would want to handle NULLs being passed into the parameters. Expanding on your solution, here is how we can pull it all together in a function that I will be adding to my SQL toolbox. Thanks!

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[fn_get_string_occurrences_count]') AND type in (N'FN', N'IF', N'TF', N'FS', N'FT'))

    DROP FUNCTION [dbo].[fn_get_string_occurrences_count]

    GO

    CREATE FUNCTION dbo.fn_get_string_occurrences_count

    (

    @InputString VARCHAR(MAX),

    @SearchString VARCHAR(256)

    )

    RETURNS INT

    AS

    BEGIN

    RETURN (LEN(COALESCE(@InputString, ''))

    - LEN(REPLACE(COALESCE(@InputString, ''), COALESCE(@SearchString, ''), '')))

    / LEN(COALESCE(NULLIF(@SearchString,''), 'X'))

    END

    GO

  • Thanks. I think I may have something to actually use this on.

  • You can use NGrams8K for a set-based way of solving this kind of thing.

    The function:

    CREATE FUNCTION dbo.NGrams8k

    (

    @string varchar(8000), -- Input string

    @N int -- requested token size

    )

    /****************************************************************************************

    Purpose:

    A character-level @N-Grams function that outputs a contiguous stream of @N-sized tokens

    based on an input string (@string). Accepts strings up to 8000 varchar characters long.

    For more information about N-Grams see: http://en.wikipedia.org/wiki/N-gram.

    Compatibility:

    SQL Server 2008+ and Azure SQL Database

    Syntax:

    --===== Autonomous

    SELECT position, token FROM dbo.NGrams8k(@string,@N);

    --===== Against a table using APPLY

    SELECT s.SomeID, ng.position, ng.string

    FROM dbo.SomeTable s

    CROSS APPLY dbo.NGrams8K(s.SomeValue,@N) ng;

    Parameters:

    @string = The input string to split into tokens.

    @N = The size of each token returned.

    Returns:

    Position = bigint; the position of the token in the input string

    token = varchar(8000); a @N-sized character-level N-Gram token

    Examples:

    --===== Turn the string, 'abcd' into unigrams, bigrams and trigrams

    SELECT position, token FROM dbo.NGrams8k('abcd',1); -- unigrams (@N=1)

    SELECT position, token FROM dbo.NGrams8k('abcd',2); -- bigrams (@N=2)

    SELECT position, token FROM dbo.NGrams8k('abcd',3); -- trigrams (@N=1)

    --===== How many times the substring "AB" appears in each record

    DECLARE @table TABLE(stringID int identity primary key, string varchar(100));

    INSERT @table(string) VALUES ('AB123AB'),('123ABABAB'),('!AB!AB!'),('AB-AB-AB-AB-AB');

    SELECT string, occurances = COUNT(*)

    FROM @table t

    CROSS APPLY dbo.NGrams8k(t.string,2) ng

    WHERE ng.token = 'AB'

    GROUP BY string;

    Developer Notes:

    1. This function is not case sensitive

    2. Many functions that use NGrams8k will see a huge performance gain when the optimizer

    creates a parallel query plan. One way to get a parallel query plan (if the optimizer

    does not chose one) is to use make_parallel by Adam Machanic which can be found here:

    sqlblog.com/blogs/adam_machanic/archive/2013/07/11/next-level-parallel-plan-porcing.aspx

    3. When @N is less than 1 or greater than the datalength of the input string then no

    tokens (rows) are returned.

    4. This function can also be used as a tally table with the position column being your

    "N" row. To do so use REPLICATE to create an imaginary string, then use NGrams8k to

    split it into unigrams then only return the position column. NGrams8k will get you up

    to 8000 numbers. There will be no performance penalty for sorting by position in

    ascending order but there is for sorting in descending order. To get the numbers in

    descending order without forcing a sort in the query plan use the following formula:

    N = <highest number>-position+1.

    Pseudo Tally Table Examples:

    --===== (1) Get the numbers 1 to 100 in ascending order:

    SELECT N = position FROM dbo.NGrams8k(REPLICATE(0,100),1);

    --===== (2) Get the numbers 1 to 100 in descending order:

    DECLARE @maxN int = 100;

    SELECT N = @maxN-position+1

    FROM dbo.NGrams8k(REPLICATE(0,@maxN),1)

    ORDER BY position;

    -- note: you don't need a variable, I used one to make the example easier to understand.

    ----------------------------------------------------------------------------------------

    Revision History:

    Rev 00 - 20140310 - Initial Development - Alan Burstein

    Rev 01 - 20150522 - Removed DQS N-Grams functionality, improved iTally logic. Also Added

    conversion to bigint in the TOP logic to remove implicit conversion

    to bigint - Alan Burstein

    Rev 03 - 20150909 - Added logic to only return values if @N is greater than 0 and less

    than the length of @string. Updated comment section. - Alan Burstein

    Rev 04 - 20151029 - Added ISNULL logic to the TOP clause for the @string and @N

    parameters to prevent a NULL string or NULL @N from causing "an

    improper value" being passed to the TOP clause. - Alan Burstein

    ****************************************************************************************/

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    WITH

    L1(N) AS

    (

    SELECT 1

    FROM (VALUES -- 90 NULL values used to create the CTE Tally table

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),

    (NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL)

    ) t(N)

    ),

    iTally(N) AS -- my cte Tally table

    (

    SELECT TOP(ABS(CONVERT(BIGINT,(DATALENGTH(ISNULL(@string,''))-(ISNULL(@N,1)-1)),0)))

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -- Order by a constant to avoid a sort

    FROM L1 a CROSS JOIN L1 b -- cartesian product for 8100 rows (90^2)

    )

    SELECT

    position = N, -- position of the token in the string(s)

    token = SUBSTRING(@string,CAST(N AS int),@N) -- the @N-Sized token

    FROM iTally

    WHERE @N > 0 AND @N <= DATALENGTH(@string); -- force the function to ignore a "bad @N"

    The solution:

    DECLARE @string varchar(1000) = 'The quick brown fox jumps over the lazy dog',

    @searchString varchar(100) = 'the';

    SELECT COUNT(*)

    FROM dbo.NGrams8k(@string,LEN(@searchString))

    WHERE token = @searchString;

    if you need case sensitivity you would change the last line to WHERE token = @searchString COLLATE Latin1_General_BIN;

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply