Help with XML Splitter De-entitization, please.

  • I'm writing an article on performance testing and one of the functions I'm testing is an XML CSV splitter. Here's the code from that splitter.

    CREATE FUNCTION dbo.SplitStrings_XML

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN (

    SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')

    FROM (SELECT x = CONVERT(XML,

    '<i>'

    + REPLACE(@List, @Delimiter, '</i><i>')

    + '</i>'

    ).query('.')

    ) AS a CROSS APPLY x.nodes('i') AS y(i)

    )

    ;

    The problem that I'm having with it is that it bombs when you pass it a string with XML reserved characters in it as in the following...

    SELECT *

    FROM dbo.SplitStrings_XML('"T-SQL",&,<XML>', N',')

    ;

    ... which returns the following error...

    [font="Courier New"] Msg 9421, Level 16, State 1, Line 1

    XML parsing: line 1, character 19, illegal name character

    [/font]

    I'm terrible at XML and I've done several searches to find out how to de-entitize the special characters with no joy. How should the function be modified so that it can return XML special characters? What I'd expect for the example string I gave would be a table result that looks like this...

    [font="Courier New"]"T-SQL"

    &

    <XML>[/font]

    The function works fine if no XML special characters are used.

    SELECT *

    FROM dbo.SplitStrings_XML('T-SQL,M,XML', N',')

    ;

    Thanks for any help, folks. I sure do need it when it comes to XML.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    in this case you will have to use CDATA (Character Data) elements in the construction of the XML.

    😎

    Here is the function with the necessary modification to handle any type of input.

    CREATE FUNCTION dbo.SplitStrings_XML

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN (

    SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')

    FROM (SELECT x = CONVERT(XML,

    '<i><![CDATA['

    + REPLACE(@List, @Delimiter, ']]></i><i><![CDATA[')

    + ']]></i>'

    ).query('.')

    ) AS a CROSS APPLY x.nodes('i') AS y(i)

    )

    ;

    Testing it

    SELECT *

    FROM dbo.SplitStrings_XML('"T-SQL",&,<XML>', N',')

    ;

    Output

    Item

    --------

    "T-SQL"

    &

    <XML>

  • The XML construct within that function is sub-optimal, 23 operators in the execution plan and there of 3 XML Reader Table valued functions.

    😎

    This can be significantly improved by changing the XML construct, here is a modified version which reduces the construct to a single constant scan operator and the execution plan down to 12 operators.

    CREATE FUNCTION dbo.SplitStrings_XML_EE

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    (

    SELECT

    X.DATA.value('(./text())[1]', 'nvarchar(4000)') AS Item

    FROM

    (

    SELECT

    CONVERT(XML,X.TXML,1) AS XXML

    FROM (

    SELECT

    '<i><![CDATA['

    + REPLACE(@List, @Delimiter, ']]></i><i><![CDATA[')

    + ']]></i>'

    ) AS X(TXML)

    ) XOUT

    CROSS APPLY XOUT.XXML.nodes('/i') X(DATA)

    )

    ;

    Edit: Testing showed that the modified version was three times slower regardless of the execution plan simplification.

    😎

  • Eirikur Eiriksson (12/18/2016)


    Hi Jeff,

    in this case you will have to use CDATA (Character Data) elements in the construction of the XML.

    😎

    You just can't believe all the examples I've been through in the last 24 hours never mind the ones I've seen over the last decade or so. I've NEVER seen mention of "CDATA" anywhere before. Thanks Eirikur.

    That does bring up another question, though. I'm nearly a total idiot when it comes to XML usage in SQL Server especially when it comes to playing it against a whole column XML in a table. I have to bite the bloody bullet and learn it, especially how it should be/can be used in SQL Server.

    I know that it "only" takes studying it and practicing with it but, as with my early days using T-SQL, I've not actually found a good book or internet article on the subject or at least not one that I like very much. I absolutely do understand hierarchies in XML and elements vs entities but I don't know how to interrogate them in T-SQL. I can read and understand examples that others have written but I'lll be damned if I can write them from scratch myself never mind be creative or innovative.

    I've seen it before... you seem to handle XML in SQL almost by second nature. Can you make a recommendation as to some good places/books to study XML in T-SQL that actually do explain things like why you need a "name space" line in some cases and not others or even what it actually is and does. I need something that starts out with the basics like when and how to use "@" instead of something else (I know it works with entities rather than elements but don't know how to use such things when shredding an entire column or even a single blob).

    And, much like T-SQL, I'd eventually like to get to where you're at with XML where someone asks a question and you go "sure... no problem... all you need to do is add this to your code and you're done".

    And very well done on that. Thanks again for the timely and accurate help, Eirikur.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (12/18/2016)


    You just can't believe all the examples I've been through in the last 24 hours never mind the ones I've seen over the last decade or so. I've NEVER seen mention of "CDATA" anywhere before. Thanks Eirikur.

    You are very welcome Jeff and I'm glad I could help. I don't find this hard to believe, unfortunately there are many myths to be debunked when it comes to the subject of XML in SQL Server and the majority of examples out there come from few sources with relatively little value added content, kind of "if it works then it's good enough".

    😎

    That does bring up another question, though. I'm nearly a total idiot when it comes to XML usage in SQL Server especially when it comes to playing it against a whole column XML in a table. I have to bite the bloody bullet and learn it, especially how it should be/can be used in SQL Server.

    As far as I've seen from your input here on SqlServerCentral, you are above the average which (and don't take any offence here) is apallingly low. There are only about a handful of regulars here which have a good understanding of XML/XQuery, something that is apparent when looking at the responders on the XML thread on the forum.

    I've seen it before... you seem to handle XML in SQL almost by second nature. Can you make a recommendation as to some good places/books to study XML in T-SQL that actually do explain things like why you need a "name space" line in some cases and not others or even what it actually is and does. I need something that starts out with the basics like when and how to use "@" instead of something else (I know it works with entities rather than elements but don't know how to use such things when shredding an entire column or even a single blob).

    And, much like T-SQL, I'd eventually like to get to where you're at with XML where someone asks a question and you go "sure... no problem... all you need to do is add this to your code and you're done".

    And very well done on that. Thanks again for the timely and accurate help, Eirikur.

    Again Jeff, very happy to help, just as you have done countless of times. I've put down few good resources, unfortunately I don't have many, probably time for me to finish some of the articles and the book I've started on the subject, the latter being a subject to finding the right publisher and of course the time;-).

    Resources:

    XQuery Implementation in a Relational Database System

    http://courses.washington.edu/info445/docs/old08/p1175-pal.pdf

    This whitepaper is an essential read as it builds the foundation knowledge of XML XQuery implementation on the SQL Server platform.

    XML Standards Library

    http://schemas.liquid-technologies.com/

    Very good reference to various XML standards

    Pro SQL Server 2008 XML

    https://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Pro+SQL+Server+2008+XML

    One of the very few books on the subject, it is fine but cannot class it as essential.

    O'Reilly's XML.com

    http://www.xml.com

    Many good readings on the subject.

    Dr Dobb's XML & Relational Databases

    http://www.drdobbs.com/database/xml-relational-databases/184405339

    Good article on XML in relational databases.

    Stairway to XML

    http://www.sqlservercentral.com/stairway/92778/

    Collection of 10 good articles on XML in SQL Server, Rob Sheldon does a very good job here but the only downside I see is the absence of the performance perspective.

    XQuery Tutorial

    http://www.w3schools.com/xml/xquery_intro.asp

    Quite extensive collection of XQuery tutorials, most of the stuff here will work fine on SQL Server 2005 and later.

  • Very cool feedback, Eirikur. Thank you for taking the time to post those resources.

    Shifting gears, I wish there were something I could do to help push you into writing articles and finishing the book you started on the subject of XML and SQL Server. If nothing else, I'd be happy to proof read and test your code, provided that I don't have to load the latest and greatest in SQL Server to run it. Work keeps me hopping and I don't have the time to live on the bleeding edge. 😀

    Shifting gears a bit, I can help sweeten the pot with, perhaps, a joint article to determine which column statistics are actually being used by queries so that the rest can be deleted. It's not something new but it isn't a widespread bit of knowledge and it requires some XML interrogation. I'm not sure that the shredder that someone else wrote for this is all that it can be when it comes to performance although it's not absolutely horrible.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To fix the problem with XML special characters in your splitter you can use "select @List for xml path('')" to create entities for your invalid characters.

    The entities will be converted back to the original characters in the value() function.

    CREATE FUNCTION dbo.SplitStrings_XML

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN (

    SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')

    FROM (SELECT x = CONVERT(XML,

    '<i>'

    + REPLACE((SELECT @List FOR XML PATH('')), @Delimiter, '</i><i>')

    + '</i>'

    ).query('.')

    ) AS a CROSS APPLY x.nodes('i') AS y(i)

    )

    ;

    Using CDATA to fix this introduces another possible error and that is if the value contains the sequence of characters ]]>. That specific sequence is invalid in a CDATA since it is used to recognize the end of a CDATA section.

    The reason why the rewrite of Eirikur was slower is blogged about by Paul White in Compute Scalars, Expressions and Execution Plan Performance. The cast to XML is deferred until it is actually needed so SQL Server will perform the convert (parse and validate) once for each call to nodes() and values() function. Using .query('.') on result of the cast is a workaround to prevent the deferred compute scalar.

    While I'm at it I might as well throw in a version of a XML splitter of my own. Please feel free to try it against your test data and use it in your blog post if you want to.

    Instead of creating an XML element with one value in each element I instead insert an empty element as a replacement of the delimiter like this aa<i/>bb<i/>cc.

    That XML has three text() elements at the root level and I use nodes() to shred out those text() elements. As a fun twist I also do something that everyone says never to do ever. I put it in a Multi-Statement Table Function. That way you can work with an XML variable instead of a casted string. It is in my testing a bit faster. The drawback is of course the need to invoke the function for each row instead of having the splitting inline.

    create function dbo.SplitStrings_XML

    (

    @List nvarchar(max),

    @Delimiter nvarchar(255)

    )

    returns @returntable table

    (

    Item nvarchar(4000)

    ) with schemabinding

    as

    begin

    declare @X xml = replace((select @List for xml path('')), @Delimiter, '<i/>');

    insert @returntable(Item)

    select T.X.value('.', 'nvarchar(4000)')

    from @X.nodes('text()') as T(X)

    return

    end

    You can of course use the root text() node splitting in an inline TVF as well. I found it to be slightly faster than your version but not as fast as the multi line TVF.

    CREATE FUNCTION dbo.SplitStrings_XML

    (

    @List NVARCHAR(MAX),

    @Delimiter NVARCHAR(255)

    )

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN (

    SELECT Item = y.i.value('.', 'nvarchar(4000)')

    FROM (SELECT x = CONVERT(XML, REPLACE((select @List for xml path('')), @Delimiter, '<i/>')

    ).query('.')

    ) AS a CROSS APPLY x.nodes('text()') AS y(i)

    )

    ;

  • Thanks, Mikael. I appreciate the great information and the caveat alert on CDATA.

    Even after two decades in this community, I continue to be amazed at the ancillary information that people are willing to share and demonstrate. Thank you both for the tips and the education.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson (12/18/2016)


    Jeff Moden (12/18/2016)


    You just can't believe all the examples I've been through in the last 24 hours never mind the ones I've seen over the last decade or so. I've NEVER seen mention of "CDATA" anywhere before. Thanks Eirikur.

    You are very welcome Jeff and I'm glad I could help. I don't find this hard to believe, unfortunately there are many myths to be debunked when it comes to the subject of XML in SQL Server and the majority of examples out there come from few sources with relatively little value added content, kind of "if it works then it's good enough".

    😎

    That does bring up another question, though. I'm nearly a total idiot when it comes to XML usage in SQL Server especially when it comes to playing it against a whole column XML in a table. I have to bite the bloody bullet and learn it, especially how it should be/can be used in SQL Server.

    As far as I've seen from your input here on SqlServerCentral, you are above the average which (and don't take any offence here) is apallingly low. There are only about a handful of regulars here which have a good understanding of XML/XQuery, something that is apparent when looking at the responders on the XML thread on the forum.

    I've seen it before... you seem to handle XML in SQL almost by second nature. Can you make a recommendation as to some good places/books to study XML in T-SQL that actually do explain things like why you need a "name space" line in some cases and not others or even what it actually is and does. I need something that starts out with the basics like when and how to use "@" instead of something else (I know it works with entities rather than elements but don't know how to use such things when shredding an entire column or even a single blob).

    And, much like T-SQL, I'd eventually like to get to where you're at with XML where someone asks a question and you go "sure... no problem... all you need to do is add this to your code and you're done".

    And very well done on that. Thanks again for the timely and accurate help, Eirikur.

    Again Jeff, very happy to help, just as you have done countless of times. I've put down few good resources, unfortunately I don't have many, probably time for me to finish some of the articles and the book I've started on the subject, the latter being a subject to finding the right publisher and of course the time;-).

    Resources:

    XQuery Implementation in a Relational Database System

    http://courses.washington.edu/info445/docs/old08/p1175-pal.pdf

    This whitepaper is an essential read as it builds the foundation knowledge of XML XQuery implementation on the SQL Server platform.

    XML Standards Library

    http://schemas.liquid-technologies.com/

    Very good reference to various XML standards

    Pro SQL Server 2008 XML

    https://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Pro+SQL+Server+2008+XML

    One of the very few books on the subject, it is fine but cannot class it as essential.

    O'Reilly's XML.com

    http://www.xml.com

    Many good readings on the subject.

    Dr Dobb's XML & Relational Databases

    http://www.drdobbs.com/database/xml-relational-databases/184405339

    Good article on XML in relational databases.

    Stairway to XML

    http://www.sqlservercentral.com/stairway/92778/

    Collection of 10 good articles on XML in SQL Server, Rob Sheldon does a very good job here but the only downside I see is the absence of the performance perspective.

    XQuery Tutorial

    http://www.w3schools.com/xml/xquery_intro.asp

    Quite extensive collection of XQuery tutorials, most of the stuff here will work fine on SQL Server 2005 and later.

    Didn't verify the links, but I made them easier to use.

    Resources:

    XQuery Implementation in a Relational Database System

    http://courses.washington.edu/info445/docs/old08/p1175-pal.pdf

    This whitepaper is an essential read as it builds the foundation knowledge of XML XQuery implementation on the SQL Server platform.

    XML Standards Library

    http://schemas.liquid-technologies.com/

    Very good reference to various XML standards

    Pro SQL Server 2008 XML

    https://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Pro+SQL+Server+2008+XML

    One of the very few books on the subject, it is fine but cannot class it as essential.

    O'Reilly's XML.com

    http://www.xml.com

    Many good readings on the subject.

    Dr Dobb's XML & Relational Databases

    http://www.drdobbs.com/database/xml-relational-databases/184405339

    Good article on XML in relational databases.

    Stairway to XML

    http://www.sqlservercentral.com/stairway/92778/

    Collection of 10 good articles on XML in SQL Server, Rob Sheldon does a very good job here but the only downside I see is the absence of the performance perspective.

    XQuery Tutorial

    http://www.w3schools.com/xml/xquery_intro.asp

  • Thanks Lynn

    😎

  • Eirikur Eiriksson (12/19/2016)


    Thanks Lynn

    😎

    You are welcome.

    I am in the process of trying to learn how to shred XML for importing data into our system. It is getting interesting as I learn additional requirements and potential issues resulting I from poor database design choices by non-database developers who think of databases as simple data stores and decisions to use table short names in the XML that result in table short names may also be a column name in the table. Lots of fun when XML may as well be Greek to me.

  • Lynn, feel free to send me any examples and problems, I'm happy to help

    😎

  • @Mikael,

    Perhaps I'm doing something wrong but the mTVF version doesn't return elements that result in empty strings..

    [font="Arial Black"]Setup:[/font]

    --===== Conditionally drop the test table to make reruns easier.

    IF OBJECT_ID('tempdb.dbo.TestData','U') IS NOT NULL

    DROP TABLE tempdb.dbo.TestData

    ;

    GO

    --===== Create and populate the test table on-the-fly

    SELECT d.String, d.TestName

    INTO dbo.TestData

    FROM

    (

    SELECT N'' ,N'Empty String' UNION ALL

    SELECT N' ' ,N'Blank Element' UNION ALL

    SELECT NULL ,N'Null Element' UNION ALL

    SELECT N'Single Element' ,N'Single Element' UNION ALL

    SELECT N',Leading/Trailing,' ,N'Leading/Trailing Delimiter #1' UNION ALL

    SELECT N',Leading,Trailing,' ,N'Leading/Trailing Delimiter #2' UNION ALL

    SELECT N'Adjacent,,Delimiters' ,N'Adjacent Delimiters #1' UNION ALL

    SELECT N'Adjacent,,,Delimiters',N'Adjacent Delimiters #2' UNION ALL

    SELECT N',' ,N'Delimiters Only #1' UNION ALL

    SELECT N',,' ,N'Delimiters Only #2' UNION ALL

    SELECT 'Mr. & Mrs.' ,N'Special Character' UNION ALL

    SELECT N'"Embedded, Delimiter"',N'Embedded Delimiter'

    )d (String,TestName)

    ;

    GO

    --===== Create the function

    create function dbo.SplitStrings_XMLmTVF

    (

    @List nvarchar(max),

    @Delimiter nvarchar(255)

    )

    returns @returntable table

    (

    Item nvarchar(4000)

    ) with schemabinding

    as

    begin

    declare @X xml = replace((select @List for xml path('')), @Delimiter, '<i/>');

    insert @returntable(Item)

    select T.X.value('.', 'nvarchar(4000)')

    from @X.nodes('text()') as T(X)

    return

    end

    GO

    [font="Arial Black"]Run the test:[/font]

    --===== Test the function. It doesn't return empty string elements.

    SELECT d.*,s.*

    FROM dbo.TestData d

    OUTER APPLY dbo.SplitStrings_XMLmTVF (d.String,N',') s

    ;

    GO

    Results... any element that results in an empty string isn't returned.

    String TestName Item

    --------------------- ----------------------------- ----------------

    Empty String NULL

    Blank Element

    NULL Null Element NULL

    Single Element Single Element Single Element

    ,Leading/Trailing, Leading/Trailing Delimiter #1 Leading/Trailing

    ,Leading,Trailing, Leading/Trailing Delimiter #2 Leading

    ,Leading,Trailing, Leading/Trailing Delimiter #2 Trailing

    Adjacent,,Delimiters Adjacent Delimiters #1 Adjacent

    Adjacent,,Delimiters Adjacent Delimiters #1 Delimiters

    Adjacent,,,Delimiters Adjacent Delimiters #2 Adjacent

    Adjacent,,,Delimiters Adjacent Delimiters #2 Delimiters

    , Delimiters Only #1 NULL

    ,, Delimiters Only #2 NULL

    Mr. & Mrs. Special Character Mr. & Mrs.

    "Embedded, Delimiter" Embedded Delimiter "Embedded

    "Embedded, Delimiter" Embedded Delimiter Delimiter"

    (16 row(s) affected)

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff!

    Sorry for the late reply. I guess holidays came along and this thread fell out of focus.

    Your observation are correct about the behaviour of splitting on text(). It will not pickup empty strings.

    You can also forget about what I said about MTVF being faster. It is not and I tested it incorrectly.

    And the explanation of what I got caught in is described here by Paul White.

    http://dba.stackexchange.com/questions/159333/does-sql-server-cache-the-result-of-a-multi-statement-table-valued-function

    I do however have another XML string splitter (fast of course :-)) that you could have a look at if you like. It has one big drawback and that is that the data may not contain the character sequence ?> but seems to handle your test data properly.

    create function [dbo].[SplitStrings_XML_PI]

    (

    @List nvarchar(max),

    @Delimiter nvarchar(255)

    )

    returns table with schemabinding

    as

    return

    (

    select T2.X.value(N'.', N'nvarchar(4000)') as Item

    from (select cast(N'<?X '+replace(@List, @Delimiter, N'?><?X ') + N'?>' as xml).query(N'.')) as T1(X)

    cross apply T1.X.nodes(N'/processing-instruction("X")') as T2(X)

    );

  • Hi Jeff -

    I went into hibernation for a bit, only to come out and find you dabbling in XML..... Small wonders never cease I guess 🙂

    I guess without spilling the beans too much - what was the purpose for trying to use the split string functions against the raw XML? I am just curious if were was a better way to get at what you're looking for.

    Are you trying to infer the structure perhaps? I might have some tidbits to do so if that's the case.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply