• Solomon Rutzky (6/26/2016)


    Hey Alan. Interesting request. I do have a suggestion for the code, but first I would like to mention that it is probably best to state at the beginning that the desire for the code is to do performance comparisons as it will help inform / guide responders. For example, it would help when deciding between expediency of writing the code and efficiency of processing the code. Case in point: LINQ is very powerful, flexible, and expressive. However, that does come at a cost at runtime.

    Also, when comparing performance between T-SQL and SQLCLR code, it is important to test both a VARCHAR(4000) and NVARCHAR(4000) versions of the T-SQL code since SQLCLR can only do NVARCHAR. So testing both using NVARCHAR(4000) will get the truest sense of performance difference, and testing with VARCHAR(4000) will provide a pragmatic difference since if one knows that their data will only ever be VARCHAR, then that reality won't be constrained with trying to be fair in testing ;-). Please note that I am using 4000 instead of 8000 for the VARCHAR datatypes since it doesn't help to test 8000 characters against SQLCLR using NVARCHAR(4000) as the extra 4000 characters will be silently truncated, resulting in a different input string given to the SQLCLR object.

    Here is the iTVF with Tally Table I used (I created the VARCHAR(4000) version by copying and pasting it, changing the name, and changing "@String" to be VARCHAR(4000)):

    CREATE FUNCTION dbo.NGramITVF(@String NVARCHAR(4000), @N INT)

    RETURNS TABLE

    --WITH SCHEMABINDING

    AS RETURN

    WITH nums AS

    (

    SELECT TOP (LEN(@string) - (@N - 1))

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS [Num]

    FROM [sys].[all_columns]

    )

    SELECT nums.[Num] AS [Position],

    SUBSTRING(@string, nums.[Num], @N) AS [Token]

    FROM nums;

    And here is the .NET / C# code that I used for the SQLCLR streaming TVF:

    [Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName = "NGramSolomon_FillRow",

    TableDefinition = "Position INT, Token NVARCHAR(4000)")]

    public static IEnumerable NGramSolomon(

    [SqlFacet(MaxSize = 4000)] SqlChars InputString, SqlInt32 TokenSize)

    {

    if (InputString.IsNull || TokenSize.IsNull)

    {

    yield break;

    }

    int _Take = TokenSize.Value;

    int _Limit = ((int)InputString.Length - _Take);

    if (_Take < 1 || _Limit < 0)

    {

    yield break;

    }

    ResultRow _Result = new ResultRow();

    char[] _Chars = InputString.Value;

    char[] _Temp = new char[_Take];

    for (int _Index = 0; _Index <= _Limit; _Index++)

    {

    Array.Copy(_Chars, _Index, _Temp, 0, _Take);

    _Result.Position = (_Index + 1);

    _Result.Token = new SqlChars(_Temp);

    yield return _Result;

    }

    }

    public class ResultRow

    {

    public int Position;

    public SqlChars Token;

    }

    public static void NGramSolomon_FillRow(Object obj,

    out SqlInt32 position, out SqlChars token)

    {

    ResultRow _Item = (ResultRow)obj;

    position = new SqlInt32(_Item.Position);

    token = _Item.Token;

    }

    Test results:

    NVARCHAR(4000) Tests (4000 characters):

    NGLoop : 60 - 80 ms

    SQLCLR_ConAlexis : 375 - 400 ms

    iTVF_TallyTable : 2 - 6 ms

    iTVF_TallyTableVC : 2 - 6 ms

    SQLCLR_Solomon : 2 - 9 ms

    NVARCHAR(MAX) Tests (40,000 characters):

    iTVF_TallyTable : 34 - 50 ms

    SQLCLR_ConAlexis : 39,000 ms

    SQLCLR_Solomon : 34 - 56 ms

    As you can see, the T-SQL version is only a few milliseconds faster than the SQLCLR. Of course, perhaps Alan's version using his "numbersAB" iTVF will be slightly faster, or maybe had I done an inline Tally Table then the T-SQL version might have been slightly faster. But that still wouldn't be much of a difference.

    Still, it seems that there is no compelling reason to use SQLCLR for this particular task. HOWEVER, in so far as the NGram stuff might be used for string searches, it should be noted that depending on what those strings are, there could be a case for using SQLCLR. That case would be if the input strings use combining characters. Combining characters are accents and other marks that can be added to many letters and will display in the same position as the base character, appearing as a single character. And multiple combining characters can be used. When present, the base character and any number of combining characters that follow it need to be treated as a single unit. T-SQL has no facility for handling combining sequences, but .NET does. And no, combining characters are not the same as Supplementary Characters, which can be handled correctly in T-SQL when using a Collation ending in _SC, which were introduced in SQL Server 2012.

    Take care,

    Solomon..

    Great stuff Solomon! Its been a busy couple of weeks. I'm going to compile and play around with this code later today and post back. Thanks again sir.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001