A varchar(MAX) STRING_SPLIT function for SQL 2012 and above

  • fgrodriguez

    Valued Member

    Points: 55

    SET ANSI_NULLS ON
    GO
    SET QUOTED_IDENTIFIER ON
    GO
    CREATE FUNCTION [dbo].[StringToTableStrings]
    (
    @cadena NVARCHAR(MAX),
    @separador NVARCHAR(MAX) = N',',
    @eliminarNulos BIT = 1,
    @trim BIT = 1
    ) RETURNS TABLE WITH SCHEMABINDING
    AS RETURN
    WITH lista(val) AS (
    SELECT @cadena val
    UNION ALL
    SELECT SUBSTRING(val, CHARINDEX(@separador, val) + DATALENGTH(@separador) / 2, DATALENGTH(val)/2) AS val
    FROM lista
    WHERE CHARINDEX(@separador, val) > 0
    ), pares AS (
    SELECT par
    FROM
    (SELECT SUBSTRING(val, 1, IIF(CHARINDEX(@separador, val) = 0, DATALENGTH(val)/2, CHARINDEX(@separador, val) - 1)) AS par
    FROM lista) AS valores
    WHERE ISNULL(@eliminarNulos, 0) != 1 OR (par IS NOT NULL AND RTRIM(LTRIM(par)) <> N'')
    )
    SELECT IIF(ISNULL(@trim, 0) != 1, par, LTRIM(RTRIM(par))) AS cadena1
    FROM pares
    GO

    Indeed, I had not tried it using space as a separator. This is the new corrected version.

    SELECT *
    FROM dbo.[StringToTableStrings](' Lorem ipsum dolor sit amet.',' ',1,0)
    OPTION(MAXRECURSION 0)

    I also know that recursion does not estimate well. And the function you wrote is more efficient.

    https://www.sqlservercentral.com/articles/hidden-rbar-counting-with-recursive-ctes

    Simply, in my developments, I needed to reuse the table returned by the operation and that's why I made an mTVF. And he also needed to eliminate empty chains and a trim of the results.

    I appreciate your efforts to study the function.

     

  • Solomon Rutzky

    SSCoach

    Points: 16256

    Jonathan AC Roberts wrote:

    Thom A wrote:

    having something that has a varchar passed to it that returns an nvarchar could be very bad for SARGability, if the column being compared afterwards is a varchar too (as the Column would be implicitly converted, meaning any indexes are effectively useless).

    I've just run a test and not converting the results of the dbo_STRING_SPLIT function and joining the results to a varchar still does an index seek on the table. ... It does a seek on both queries but the second query (without the convert varchar(5)) is a bit slower to execute

    Hi there.

    Regarding @thom-a's statement about implicit conversions: that is an unfortunately common misunderstanding / over-simplification. The CONVERT_IMPLICIT is not the cause of the performance degradation. It's Microsoft's horrible miscalculation that the pain of switching the default collation for OSs using the US English (i.e. en-US) locale / culture to Latin1_General_CI_AS (and dealing with a customers upgrading and having problems with queries against temp tables doing joins on string columns) would be worse than keeping an increasingly obsolete default collation for new installations (i.e. SQL_Latin1_General_CP1_CI_AS).  SQL Server collations (those starting with "SQL_") have different sort orders for VARCHAR and NVARCHAR data, and that is why the conversion (implicit or explicit) to NVARCHAR causes problems: the index isn't in the correct physical order, so it needs to be scanned or ignored. Please see my post, "Impact on Indexes When Mixing VARCHAR and NVARCHAR Types", for full details.

     

    Regarding @jonathan-ac-roberts 's test: you might want to include a GO between the two test queries. It might not make a difference, but I have found that quite often running multiple queries in a single batch will produce different performance results as compared to running them individually / in separate batches. Typically one of them, if executed in the same batch, ends up being affected for the worse than when executed independently.

     

    Take care,

    Solomon...

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Jeff Moden

    SSC Guru

    Points: 996645

    Solomon Rutzky wrote:

    Regarding @jonathan-ac-roberts 's test: you might want to include a GO between the two test queries. It might not make a difference, but I have found that quite often running multiple queries in a single batch will produce different performance results as compared to running them individually / in separate batches. Typically one of them, if executed in the same batch, ends up being affected for the worse than when executed independently.

    Excellent point, Solomon!  To wit, I've found that even the use of GO between batches is sometimes not enough.  Sometimes they need to physically be executed separately.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Solomon Rutzky

    SSCoach

    Points: 16256

    Jeff Moden wrote:

    Solomon Rutzky wrote:

    Regarding @jonathan-ac-roberts 's test: you might want to include a GO between the two test queries. It might not make a difference, but I have found that quite often running multiple queries in a single batch will produce different performance results as compared to running them individually / in separate batches.

    Excellent point, Solomon!  To wit, I've found that even the use of GO between batches is sometimes not enough.  Sometimes they need to physically be executed separately.

     

    Hi Jeff. Yes, I have experienced the same, but was just trying to not throw too much into that reply ;-). But yes, nowadays when I do any performance testing of this nature, I nearly always do fully separate executions just to be sure. Thanks for mentioning that as I wasn't sure if I was the only one seeing such things.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Jonathan AC Roberts

    SSCoach

    Points: 17296

    Jeff Moden wrote:

    Solomon Rutzky wrote:

    Regarding @jonathan-ac-roberts 's test: you might want to include a GO between the two test queries. It might not make a difference, but I have found that quite often running multiple queries in a single batch will produce different performance results as compared to running them individually / in separate batches. Typically one of them, if executed in the same batch, ends up being affected for the worse than when executed independently.

    Excellent point, Solomon!  To wit, I've found that even the use of GO between batches is sometimes not enough.  Sometimes they need to physically be executed separately.

    I've tried the test with a GO between and it does make a difference to the elapsed and CPU times reported in the "set statistics" messages. It brings the query without cast(varchar) down to be only about 10% slower than the query with the cast. Without the GO, the query without the cast(varchar) is about 3 times slower than the one with cast(varchar). I can't say I understand why though?

    As for the IO, there are 10 scan counts reported on #t1 for the query without  cast(varchar) compared to zero scans with it. This is true for whether or not the GO is included.

  • Jeff Moden

    SSC Guru

    Points: 996645

    fgrodriguez wrote:

    The usefulness of this function is not that of rapid execution, but the utility of the result.

    I've heard many people speak such words and they're usually the same ones begging for a performance fix some time later.  Performance  (ie, "Rapid Execution") is second only to correct functionality and it's a very close second.  Don't dismiss it so easily.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 6 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply