Home Forums Programming XML Help with XML Splitter De-entitization, please. RE: Help with XML Splitter De-entitization, please.

  • To fix the problem with XML special characters in your splitter you can use "select @List for xml path('')" to create entities for your invalid characters.

    The entities will be converted back to the original characters in the value() function.

    CREATE FUNCTION dbo.SplitStrings_XML

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN (

    SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')

    FROM (SELECT x = CONVERT(XML,

    '<i>'

    + REPLACE((SELECT @List FOR XML PATH('')), @Delimiter, '</i><i>')

    + '</i>'

    ).query('.')

    ) AS a CROSS APPLY x.nodes('i') AS y(i)

    )

    ;

    Using CDATA to fix this introduces another possible error and that is if the value contains the sequence of characters ]]>. That specific sequence is invalid in a CDATA since it is used to recognize the end of a CDATA section.

    The reason why the rewrite of Eirikur was slower is blogged about by Paul White in Compute Scalars, Expressions and Execution Plan Performance. The cast to XML is deferred until it is actually needed so SQL Server will perform the convert (parse and validate) once for each call to nodes() and values() function. Using .query('.') on result of the cast is a workaround to prevent the deferred compute scalar.

    While I'm at it I might as well throw in a version of a XML splitter of my own. Please feel free to try it against your test data and use it in your blog post if you want to.

    Instead of creating an XML element with one value in each element I instead insert an empty element as a replacement of the delimiter like this aa<i/>bb<i/>cc.

    That XML has three text() elements at the root level and I use nodes() to shred out those text() elements. As a fun twist I also do something that everyone says never to do ever. I put it in a Multi-Statement Table Function. That way you can work with an XML variable instead of a casted string. It is in my testing a bit faster. The drawback is of course the need to invoke the function for each row instead of having the splitting inline.

    create function dbo.SplitStrings_XML

    (

    @List nvarchar(max),

    @Delimiter nvarchar(255)

    )

    returns @returntable table

    (

    Item nvarchar(4000)

    ) with schemabinding

    as

    begin

    declare @X xml = replace((select @List for xml path('')), @Delimiter, '<i/>');

    insert @returntable(Item)

    select T.X.value('.', 'nvarchar(4000)')

    from @X.nodes('text()') as T(X)

    return

    end

    You can of course use the root text() node splitting in an inline TVF as well. I found it to be slightly faster than your version but not as fast as the multi line TVF.

    CREATE FUNCTION dbo.SplitStrings_XML

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN (

    SELECT Item = y.i.value('.', 'nvarchar(4000)')

    FROM (SELECT x = CONVERT(XML, REPLACE((select @List for xml path('')), @Delimiter, '<i/>')

    ).query('.')

    ) AS a CROSS APPLY x.nodes('text()') AS y(i)

    )

    ;