SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Help with XML Splitter De-entitization, please.


Help with XML Splitter De-entitization, please.

Author
Message
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)

Group: General Forum Members
Points: 542721 Visits: 44633
I'm writing an article on performance testing and one of the functions I'm testing is an XML CSV splitter. Here's the code from that splitter.

CREATE FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN (
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM (SELECT x = CONVERT(XML,
'<i>'
+ REPLACE(@List, @Delimiter, '</i><i>')
+ '</i>'
).query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
)
;



The problem that I'm having with it is that it bombs when you pass it a string with XML reserved characters in it as in the following...

SELECT *
FROM dbo.SplitStrings_XML('"T-SQL",&,<XML>', N',')
;


... which returns the following error...
Msg 9421, Level 16, State 1, Line 1
XML parsing: line 1, character 19, illegal name character


I'm terrible at XML and I've done several searches to find out how to de-entitize the special characters with no joy. How should the function be modified so that it can return XML special characters? What I'd expect for the example string I gave would be a table result that looks like this...

"T-SQL"
&
<XML>


The function works fine if no XML special characters are used.

SELECT *
FROM dbo.SplitStrings_XML('T-SQL,M,XML', N',')
;




Thanks for any help, folks. I sure do need it when it comes to XML.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Eirikur Eiriksson
Eirikur Eiriksson
SSC Guru
SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)

Group: General Forum Members
Points: 100648 Visits: 20921
Hi Jeff,
in this case you will have to use CDATA (Character Data) elements in the construction of the XML.
Cool

Here is the function with the necessary modification to handle any type of input.
CREATE FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN (
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM (SELECT x = CONVERT(XML,
'<i><![CDATA['
+ REPLACE(@List, @Delimiter, ']]></i><i><![CDATA[')
+ ']]></i>'
).query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
)
;



Testing it
SELECT *
FROM dbo.SplitStrings_XML('"T-SQL",&,<XML>', N',')
;



Output
Item
--------
"T-SQL"
&
<XML>

Eirikur Eiriksson
Eirikur Eiriksson
SSC Guru
SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)

Group: General Forum Members
Points: 100648 Visits: 20921
The XML construct within that function is sub-optimal, 23 operators in the execution plan and there of 3 XML Reader Table valued functions.
Cool
This can be significantly improved by changing the XML construct, here is a modified version which reduces the construct to a single constant scan operator and the execution plan down to 12 operators.

CREATE FUNCTION dbo.SplitStrings_XML_EE
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE WITH SCHEMABINDING
AS
RETURN
(

SELECT
X.DATA.value('(./text())[1]', 'nvarchar(4000)') AS Item
FROM
(
SELECT
CONVERT(XML,X.TXML,1) AS XXML
FROM (
SELECT
'<i><![CDATA['
+ REPLACE(@List, @Delimiter, ']]></i><i><![CDATA[')
+ ']]></i>'
) AS X(TXML)
) XOUT
CROSS APPLY XOUT.XXML.nodes('/i') X(DATA)
)
;



Edit: Testing showed that the modified version was three times slower regardless of the execution plan simplification.
Cool
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)

Group: General Forum Members
Points: 542721 Visits: 44633
Eirikur Eiriksson (12/18/2016)
Hi Jeff,
in this case you will have to use CDATA (Character Data) elements in the construction of the XML.
Cool


You just can't believe all the examples I've been through in the last 24 hours never mind the ones I've seen over the last decade or so. I've NEVER seen mention of "CDATA" anywhere before. Thanks Eirikur.

That does bring up another question, though. I'm nearly a total idiot when it comes to XML usage in SQL Server especially when it comes to playing it against a whole column XML in a table. I have to bite the bloody bullet and learn it, especially how it should be/can be used in SQL Server.

I know that it "only" takes studying it and practicing with it but, as with my early days using T-SQL, I've not actually found a good book or internet article on the subject or at least not one that I like very much. I absolutely do understand hierarchies in XML and elements vs entities but I don't know how to interrogate them in T-SQL. I can read and understand examples that others have written but I'lll be damned if I can write them from scratch myself never mind be creative or innovative.

I've seen it before... you seem to handle XML in SQL almost by second nature. Can you make a recommendation as to some good places/books to study XML in T-SQL that actually do explain things like why you need a "name space" line in some cases and not others or even what it actually is and does. I need something that starts out with the basics like when and how to use "@" instead of something else (I know it works with entities rather than elements but don't know how to use such things when shredding an entire column or even a single blob).

And, much like T-SQL, I'd eventually like to get to where you're at with XML where someone asks a question and you go "sure... no problem... all you need to do is add this to your code and you're done".

And very well done on that. Thanks again for the timely and accurate help, Eirikur.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Eirikur Eiriksson
Eirikur Eiriksson
SSC Guru
SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)

Group: General Forum Members
Points: 100648 Visits: 20921
Jeff Moden (12/18/2016)


You just can't believe all the examples I've been through in the last 24 hours never mind the ones I've seen over the last decade or so. I've NEVER seen mention of "CDATA" anywhere before. Thanks Eirikur.

You are very welcome Jeff and I'm glad I could help. I don't find this hard to believe, unfortunately there are many myths to be debunked when it comes to the subject of XML in SQL Server and the majority of examples out there come from few sources with relatively little value added content, kind of "if it works then it's good enough".
Cool

That does bring up another question, though. I'm nearly a total idiot when it comes to XML usage in SQL Server especially when it comes to playing it against a whole column XML in a table. I have to bite the bloody bullet and learn it, especially how it should be/can be used in SQL Server.

As far as I've seen from your input here on SqlServerCentral, you are above the average which (and don't take any offence here) is apallingly low. There are only about a handful of regulars here which have a good understanding of XML/XQuery, something that is apparent when looking at the responders on the XML thread on the forum.

I've seen it before... you seem to handle XML in SQL almost by second nature. Can you make a recommendation as to some good places/books to study XML in T-SQL that actually do explain things like why you need a "name space" line in some cases and not others or even what it actually is and does. I need something that starts out with the basics like when and how to use "@" instead of something else (I know it works with entities rather than elements but don't know how to use such things when shredding an entire column or even a single blob).

And, much like T-SQL, I'd eventually like to get to where you're at with XML where someone asks a question and you go "sure... no problem... all you need to do is add this to your code and you're done".

And very well done on that. Thanks again for the timely and accurate help, Eirikur.

Again Jeff, very happy to help, just as you have done countless of times. I've put down few good resources, unfortunately I don't have many, probably time for me to finish some of the articles and the book I've started on the subject, the latter being a subject to finding the right publisher and of course the time;-).

Resources:
XQuery Implementation in a Relational Database System
http://courses.washington.edu/info445/docs/old08/p1175-pal.pdf
This whitepaper is an essential read as it builds the foundation knowledge of XML XQuery implementation on the SQL Server platform.

XML Standards Library
http://schemas.liquid-technologies.com/
Very good reference to various XML standards

Pro SQL Server 2008 XML
https://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Pro+SQL+Server+2008+XML
One of the very few books on the subject, it is fine but cannot class it as essential.

O'Reilly's XML.com
http://www.xml.com
Many good readings on the subject.

Dr Dobb's XML & Relational Databases
http://www.drdobbs.com/database/xml-relational-databases/184405339
Good article on XML in relational databases.

Stairway to XML
http://www.sqlservercentral.com/stairway/92778/
Collection of 10 good articles on XML in SQL Server, Rob Sheldon does a very good job here but the only downside I see is the absence of the performance perspective.

XQuery Tutorial
http://www.w3schools.com/xml/xquery_intro.asp
Quite extensive collection of XQuery tutorials, most of the stuff here will work fine on SQL Server 2005 and later.
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)

Group: General Forum Members
Points: 542721 Visits: 44633
Very cool feedback, Eirikur. Thank you for taking the time to post those resources.

Shifting gears, I wish there were something I could do to help push you into writing articles and finishing the book you started on the subject of XML and SQL Server. If nothing else, I'd be happy to proof read and test your code, provided that I don't have to load the latest and greatest in SQL Server to run it. Work keeps me hopping and I don't have the time to live on the bleeding edge. :-D

Shifting gears a bit, I can help sweeten the pot with, perhaps, a joint article to determine which column statistics are actually being used by queries so that the rest can be deleted. It's not something new but it isn't a widespread bit of knowledge and it requires some XML interrogation. I'm not sure that the shredder that someone else wrote for this is all that it can be when it comes to performance although it's not absolutely horrible.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Mikael Eriksson SE
Mikael Eriksson SE
Ten Centuries
Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)Ten Centuries (1.2K reputation)

Group: General Forum Members
Points: 1232 Visits: 1011
To fix the problem with XML special characters in your splitter you can use "select @List for xml path('')" to create entities for your invalid characters.
The entities will be converted back to the original characters in the value() function.

CREATE FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN (
SELECT Item = y.i.value('(./text())[1]', 'nvarchar(4000)')
FROM (SELECT x = CONVERT(XML,
'<i>'
+ REPLACE((SELECT @List FOR XML PATH('')), @Delimiter, '</i><i>')
+ '</i>'
).query('.')
) AS a CROSS APPLY x.nodes('i') AS y(i)
)
;



Using CDATA to fix this introduces another possible error and that is if the value contains the sequence of characters ]]>. That specific sequence is invalid in a CDATA since it is used to recognize the end of a CDATA section.

The reason why the rewrite of Eirikur was slower is blogged about by Paul White in Compute Scalars, Expressions and Execution Plan Performance. The cast to XML is deferred until it is actually needed so SQL Server will perform the convert (parse and validate) once for each call to nodes() and values() function. Using .query('.') on result of the cast is a workaround to prevent the deferred compute scalar.

While I'm at it I might as well throw in a version of a XML splitter of my own. Please feel free to try it against your test data and use it in your blog post if you want to.
Instead of creating an XML element with one value in each element I instead insert an empty element as a replacement of the delimiter like this aa<i/>bb<i/>cc.
That XML has three text() elements at the root level and I use nodes() to shred out those text() elements. As a fun twist I also do something that everyone says never to do ever. I put it in a Multi-Statement Table Function. That way you can work with an XML variable instead of a casted string. It is in my testing a bit faster. The drawback is of course the need to invoke the function for each row instead of having the splitting inline.

create function dbo.SplitStrings_XML
(
@List nvarchar(max),
@Delimiter nvarchar(255)
)
returns @returntable table
(
Item nvarchar(4000)
) with schemabinding
as
begin
declare @X xml = replace((select @List for xml path('')), @Delimiter, '<i/>');

insert @returntable(Item)
select T.X.value('.', 'nvarchar(4000)')
from @X.nodes('text()') as T(X)
return
end




You can of course use the root text() node splitting in an inline TVF as well. I found it to be slightly faster than your version but not as fast as the multi line TVF.

CREATE FUNCTION dbo.SplitStrings_XML
(
@List NVARCHAR(MAX),
@Delimiter NVARCHAR(255)
)
RETURNS TABLE WITH SCHEMABINDING AS
RETURN (
SELECT Item = y.i.value('.', 'nvarchar(4000)')
FROM (SELECT x = CONVERT(XML, REPLACE((select @List for xml path('')), @Delimiter, '<i/>')
).query('.')
) AS a CROSS APPLY x.nodes('text()') AS y(i)
)
;





Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)SSC Guru (542K reputation)

Group: General Forum Members
Points: 542721 Visits: 44633
Thanks, Mikael. I appreciate the great information and the caveat alert on CDATA.

Even after two decades in this community, I continue to be amazed at the ancillary information that people are willing to share and demonstrate. Thank you both for the tips and the education.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Lynn Pettis
Lynn Pettis
SSC Guru
SSC Guru (239K reputation)SSC Guru (239K reputation)SSC Guru (239K reputation)SSC Guru (239K reputation)SSC Guru (239K reputation)SSC Guru (239K reputation)SSC Guru (239K reputation)SSC Guru (239K reputation)

Group: General Forum Members
Points: 239588 Visits: 40664
Eirikur Eiriksson (12/18/2016)
Jeff Moden (12/18/2016)


You just can't believe all the examples I've been through in the last 24 hours never mind the ones I've seen over the last decade or so. I've NEVER seen mention of "CDATA" anywhere before. Thanks Eirikur.

You are very welcome Jeff and I'm glad I could help. I don't find this hard to believe, unfortunately there are many myths to be debunked when it comes to the subject of XML in SQL Server and the majority of examples out there come from few sources with relatively little value added content, kind of "if it works then it's good enough".
Cool

That does bring up another question, though. I'm nearly a total idiot when it comes to XML usage in SQL Server especially when it comes to playing it against a whole column XML in a table. I have to bite the bloody bullet and learn it, especially how it should be/can be used in SQL Server.

As far as I've seen from your input here on SqlServerCentral, you are above the average which (and don't take any offence here) is apallingly low. There are only about a handful of regulars here which have a good understanding of XML/XQuery, something that is apparent when looking at the responders on the XML thread on the forum.

I've seen it before... you seem to handle XML in SQL almost by second nature. Can you make a recommendation as to some good places/books to study XML in T-SQL that actually do explain things like why you need a "name space" line in some cases and not others or even what it actually is and does. I need something that starts out with the basics like when and how to use "@" instead of something else (I know it works with entities rather than elements but don't know how to use such things when shredding an entire column or even a single blob).

And, much like T-SQL, I'd eventually like to get to where you're at with XML where someone asks a question and you go "sure... no problem... all you need to do is add this to your code and you're done".

And very well done on that. Thanks again for the timely and accurate help, Eirikur.

Again Jeff, very happy to help, just as you have done countless of times. I've put down few good resources, unfortunately I don't have many, probably time for me to finish some of the articles and the book I've started on the subject, the latter being a subject to finding the right publisher and of course the time;-).

Resources:
XQuery Implementation in a Relational Database System
http://courses.washington.edu/info445/docs/old08/p1175-pal.pdf
This whitepaper is an essential read as it builds the foundation knowledge of XML XQuery implementation on the SQL Server platform.

XML Standards Library
http://schemas.liquid-technologies.com/
Very good reference to various XML standards

Pro SQL Server 2008 XML
https://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Pro+SQL+Server+2008+XML
One of the very few books on the subject, it is fine but cannot class it as essential.

O'Reilly's XML.com
http://www.xml.com
Many good readings on the subject.

Dr Dobb's XML & Relational Databases
http://www.drdobbs.com/database/xml-relational-databases/184405339
Good article on XML in relational databases.

Stairway to XML
http://www.sqlservercentral.com/stairway/92778/
Collection of 10 good articles on XML in SQL Server, Rob Sheldon does a very good job here but the only downside I see is the absence of the performance perspective.

XQuery Tutorial
http://www.w3schools.com/xml/xquery_intro.asp
Quite extensive collection of XQuery tutorials, most of the stuff here will work fine on SQL Server 2005 and later.


Didn't verify the links, but I made them easier to use.

Resources:
XQuery Implementation in a Relational Database System
http://courses.washington.edu/info445/docs/old08/p1175-pal.pdf
This whitepaper is an essential read as it builds the foundation knowledge of XML XQuery implementation on the SQL Server platform.

XML Standards Library
http://schemas.liquid-technologies.com/
Very good reference to various XML standards

Pro SQL Server 2008 XML
https://www.amazon.co.uk/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=Pro+SQL+Server+2008+XML
One of the very few books on the subject, it is fine but cannot class it as essential.

O'Reilly's XML.com
http://www.xml.com
Many good readings on the subject.

Dr Dobb's XML & Relational Databases
http://www.drdobbs.com/database/xml-relational-databases/184405339
Good article on XML in relational databases.

Stairway to XML
http://www.sqlservercentral.com/stairway/92778/
Collection of 10 good articles on XML in SQL Server, Rob Sheldon does a very good job here but the only downside I see is the absence of the performance perspective.

XQuery Tutorial
http://www.w3schools.com/xml/xquery_intro.asp

Cool
Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Eirikur Eiriksson
Eirikur Eiriksson
SSC Guru
SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)SSC Guru (100K reputation)

Group: General Forum Members
Points: 100648 Visits: 20921
Thanks Lynn
Cool
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum








































































































































































SQLServerCentral


Search