Extract the titles from XML

  • Mikael Eriksson SE (7/13/2015)


    I guess this is the preferred form.

    SELECT

    col.value('(Title/text())[1]', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/root/Books/Book') a(col)

    Using text() in the values clause removes one call to a table valued function and the UDX operator that is there to take care of mixed content XML.

    This is faster still if you only want the Title node.

    SELECT

    col.value('.', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/root/Books/Book/Title/text()') a(col)

    There is no difference in query plan or performance between using //Book .//Book or /root/Books/Book in the nodes function.

    Thanks. In support of that, here is a little setup to show it.

    DECLARE @x XML =

    N'<root>

    <Books>

    <Book>

    <Title>Personal</Title>

    <Author>Lee Child</Author>

    <Publisher Pubdate="20150428">Dell</Publisher>

    </Book>

    <Book>

    <Title>Never Go Back</Title>

    <Author>Lee Child</Author>

    <Publisher Pubdate="20120312">Dell</Publisher>

    </Book>

    <Book>

    <Title>Ready Player One</Title>

    <Author>Ernest Cline</Author>

    <Publisher Pubdate="20110416">Broadway</Publisher>

    </Book>

    <Book>

    <Title>The Martian</Title>

    <Author>Andy Weir</Author>

    <Publisher Pubdate="20140211">Broadway</Publisher>

    </Book>

    </Books>

    </root>'

    SET STATISTICS TIME ON;

    --cost 135.66

    SELECT col.value('data(Title[1])', 'varchar(50)') AS 'Book'

    FROM @x.nodes('root/Books/Book') a ( col );

    PRINT '///////////////////////////////////////////////';

    --cost 2.23

    SELECT col.value('.', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/root/Books/Book/Title/text()') a ( col );

    PRINT '///////////////////////////////////////////////';

    --cost 2.23

    SELECT col.value('.', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/.//Book/Title/text()') a ( col );

    SET STATISTICS TIME OFF;

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (7/13/2015)


    TomThomson (7/13/2015)


    Jeff Moden (7/13/2015)


    The correct answer isn't actually available. The correct answer should be to "hunt down the moroff that used XML to transmit purely flat data and introduce him to high velocity pork chops".

    As it is and without the leading spaces, this file contains 487 characters. If it were converted to a columnar flat file without headers, it would reduce to just 167 characters to be roughly only 1/3rd the size. Adding a single row of headers wouldn't do much to change that.

    People crank on about how slow I/O is. You won't find hardware that loads your data in a third of the time so make sure that it's mostly data you're sending or receiving instead of the tag bloat in XML especially for such flat data.

    +100. Or maybe I mean + a googol.

    The only reason to use XML is that one has some data whose structure can't reasonably and efficiently be represented as a table in an RDBMS but can be represented efficiently in XML. When the XML representation of the data is a tree with data only at the leaf nodes and every leaf node has the same set of attributes it is clearly trivial, not just reasonable and efficient, to represent the data in the leaf nodes as rows in a relational table whose column names match teh attribute names for thos leaf nodes. It is also extremely clear that any XML representation of that data is grossly inefficient both in storage space and in processing cost. So the data used in this QotD is a perfect exanple of data for which XML representation shouldn't even be considered, let alone used.

    Nevertheless, it is a nice easy question. Very pleasant to get two whole points for remembering "start at the ground and work down" (or "start at the ground and work up").

    edit: I see that more than half of the first seven comments are from people who have not fallen for the XML hype. I wonder if that'll stay above 50% as more comments come in.

    Not a fan here either.

    XML was not meant to be used as such but as a base for proprietary derivatives such as BIML. It is not surprising to see how fast was XML-based SOAP replaced by JSON-based Rich Internet Apps. (And now JSON pops up in 2016.)

    However, XML is here and from time to time it is useful.

    Steve, thanks for the question!

  • I tend to agree with Jeff, and others, but XML isn't a bad or good thing. Storing data in XML is great for interchange, and allows for flexibility in format, something that tends to be more problematic with flat files. Not to say that XML is the best way, but it's a tool that works.

    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    There are applications where it's not necessarily useful or pragmatic to develop a relational model and move data to/from XML on a regular basis, but it does serve a purpose to keep the XML around, have it backed up and synced with other relational data.

  • SQLRNNR (7/13/2015)


    Mikael Eriksson SE (7/13/2015)


    I guess this is the preferred form.

    SELECT

    col.value('(Title/text())[1]', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/root/Books/Book') a(col)

    Using text() in the values clause removes one call to a table valued function and the UDX operator that is there to take care of mixed content XML.

    This is faster still if you only want the Title node.

    SELECT

    col.value('.', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/root/Books/Book/Title/text()') a(col)

    There is no difference in query plan or performance between using //Book .//Book or /root/Books/Book in the nodes function.

    Thanks. In support of that, here is a little setup to show it.

    DECLARE @x XML =

    N'<root>

    <Books>

    <Book>

    <Title>Personal</Title>

    <Author>Lee Child</Author>

    <Publisher Pubdate="20150428">Dell</Publisher>

    </Book>

    <Book>

    <Title>Never Go Back</Title>

    <Author>Lee Child</Author>

    <Publisher Pubdate="20120312">Dell</Publisher>

    </Book>

    <Book>

    <Title>Ready Player One</Title>

    <Author>Ernest Cline</Author>

    <Publisher Pubdate="20110416">Broadway</Publisher>

    </Book>

    <Book>

    <Title>The Martian</Title>

    <Author>Andy Weir</Author>

    <Publisher Pubdate="20140211">Broadway</Publisher>

    </Book>

    </Books>

    </root>'

    SET STATISTICS TIME ON;

    --cost 135.66

    SELECT col.value('data(Title[1])', 'varchar(50)') AS 'Book'

    FROM @x.nodes('root/Books/Book') a ( col );

    PRINT '///////////////////////////////////////////////';

    --cost 2.23

    SELECT col.value('.', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/root/Books/Book/Title/text()') a ( col );

    PRINT '///////////////////////////////////////////////';

    --cost 2.23

    SELECT col.value('.', 'varchar(50)') AS 'Book'

    FROM @x.nodes('/.//Book/Title/text()') a ( col );

    SET STATISTICS TIME OFF;

    This is a very trivial example, the structure of the XML and the size makes it ill fit for any performance testing.

    😎

  • Steve Jones - SSC Editor (7/13/2015)


    I tend to agree with Jeff, and others, but XML isn't a bad or good thing. Storing data in XML is great for interchange, and allows for flexibility in format, something that tends to be more problematic with flat files. Not to say that XML is the best way, but it's a tool that works.

    However it does belong in the db, and there are valid cases for it. I might have data like this list of books, perhaps it's old data and not really queried, but I want it around. Perhaps I shred the title out of the XML and store that relationally, but I have a large amount of other data that I keep in XML as a blob. I can query it if I need it, or shred it, but I can also just store it as archival data.

    There are applications where it's not necessarily useful or pragmatic to develop a relational model and move data to/from XML on a regular basis, but it does serve a purpose to keep the XML around, have it backed up and synced with other relational data.

    I agree that there are cases where XML makes sense. However there are also cases where it makes sense to convert to a more efficient format, and a factor of 3 in storage utilisation with an even higher factor in IO utilisation is the data is ever accesed selectively and a similar factor on the processor requirements for extracting the data from the XML mess once it is in the computer's main store form a very clear justification for believeing that data in the format used in this QotD should never be, and should never have been, formatted as XML.

    A CSV file with a column name line at the front was a well-established and widely used standard for data like that in this QotD long before XML existed, and people who switched to XML for this sort of data were just responding to all the stupid overhype about the shiny new format and clearly had no understanding at all of the costs they were incurring by switching from a reasonably efficient format to a grossly inefficient one. There is some data for which XML is reasonable, but that doesn't mean that Jeff's statement that moroffs who choose XML for to represent this sort of data (a task for which it is totally unsuitable) need educating (preferably by a fairly painful method) is any the less true.

    Ah well, we will have JSON in SQL Server soon (I trust) and that should give us better performance than XML (just as it's already being used to improve the perormance of formerly XML-based wesites). It's a pity that JSON came a couple of years after XML. Perhaps things happened in that order simply because using/adapting existing technology where appropriate was less prestigious than inventing a shiny new thing?

    Tom

  • Further on the discussion, here is a small test sample

    😎

    USE msdb;

    GO

    SET NOCOUNT ON;

    DECLARE @TXML XML = (

    SELECT --TOP 200

    *

    FROM sys.tables ST

    INNER JOIN sys.columns SC ON ST.object_id = SC.object_id

    INNER JOIN SYS.types STY ON SC.system_type_id = STY.system_type_id

    INNER JOIN sys.indexes SI ON ST.object_id = SI.object_id

    FOR XML PATH('SYSTABS'), ROOT('STUFF_FROM_MSDB'),TYPE

    )

    SET STATISTICS IO,TIME ON;

    SELECT

    NAME.DATA.value('.','NVARCHAR(128)')

    FROM @TXML.nodes('//name') AS NAME(DATA)

    SET STATISTICS TIME,IO OFF;

    SET STATISTICS IO,TIME ON;

    SELECT

    NAME.DATA.value('.','NVARCHAR(128)')

    FROM @TXML.nodes('/STUFF_FROM_MSDB/SYSTABS/name/text()') AS NAME(DATA)

    SET STATISTICS TIME,IO OFF;

    The statistics

    SQL Server Execution Times:

    CPU time = 468 ms, elapsed time = 507 ms.

    SQL Server Execution Times:

    CPU time = 296 ms, elapsed time = 374 ms.

    The execution plans

  • Like it or not (I don't), there are times when as a SQL Developer or SQL DBA you have no choice but to deal with XML (I have).

    I am not sure if I should be proud or sad that I by now have had so much erro^H^H^H^H experience with XQuery that I got this one correct.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Eirikur Eiriksson (7/13/2015)


    This is a very trivial example, the structure of the XML and the size makes it ill fit for any performance testing.

    😎

    Yes, quite so. 😎

    Doing it with a ten million leaf nodes instead of just four might deliver quite different results. :hehe:

    Most people I know who are involved in computing would regard a data extraction performance test using a table with just four rows as a total waste of time (and the few who wouldn't are idiots :crazy: ).

    I'm inclined to think that measuring performance of data extraction from just 4 leaf nodes in a uniform shallow XML tree is about as useful as a performance test on extracting the data from a table with just 4 rows.

    Tom

  • TomThomson (7/13/2015)


    Eirikur Eiriksson (7/13/2015)


    This is a very trivial example, the structure of the XML and the size makes it ill fit for any performance testing.

    😎

    Yes, quite so. 😎

    Doing it with a ten million leaf nodes instead of just four might deliver quite different results. :hehe:

    Most people I know who are involved in computing would regard a data extraction performance test using a table with just four rows as a total waste of time (and the few who wouldn't are idiots :crazy: ).

    I'm inclined to think that measuring performance of data extraction from just 4 leaf nodes in a uniform shallow XML tree is about as useful as a performance test on extracting the data from a table with just 4 rows.

    Seen it too often, if it "works" (POC) doesn't mean that it works. There isn't much of a difference between SQL, XML/XQuery or other technologies when it comes to testing, simple test cases for simple minds.

    😎

  • Eirikur Eiriksson (7/13/2015)


    SELECT

    NAME.DATA.value('.','NVARCHAR(128)')

    FROM @TXML.nodes('//name') AS NAME(DATA)

    SET STATISTICS TIME,IO OFF;

    SET STATISTICS IO,TIME ON;

    SELECT

    NAME.DATA.value('.','NVARCHAR(128)')

    FROM @TXML.nodes('/STUFF_FROM_MSDB/SYSTABS/name/text()') AS NAME(DATA)

    //name and /STUFF_FROM_MSDB/SYSTABS/name/text() is not equivalent. The first specifies the name node and the second specifies the text() node in the name node making the values clause do totally different things in this query. Either use //name/text() or /STUFF_FROM_MSDB/SYSTABS/name.

  • Doing it with a ten million leaf nodes instead of just four might deliver quite different results. :hehe:

    FYI: the plan will be identical regardless of what your XML variable contains. SQL Server does not poke around in it ti figure out the most optimal way to extract data.

    Performance however will of course be different and if bad things happen when adding more rows that could not be foreseen with less rows you have issues in the table valued functions. That is however mostly not the case. The issues you get with performance in XML queries in SQL Server is visible in the plan. Sometimes you can do something to fix it sometimes not.

  • Mikael Eriksson SE (7/13/2015)


    Eirikur Eiriksson (7/13/2015)


    SELECT

    NAME.DATA.value('.','NVARCHAR(128)')

    FROM @TXML.nodes('//name') AS NAME(DATA)

    SET STATISTICS TIME,IO OFF;

    SET STATISTICS IO,TIME ON;

    SELECT

    NAME.DATA.value('.','NVARCHAR(128)')

    FROM @TXML.nodes('/STUFF_FROM_MSDB/SYSTABS/name/text()') AS NAME(DATA)

    //name and /STUFF_FROM_MSDB/SYSTABS/name/text() is not equivalent. The first specifies the name node and the second specifies the text() node in the name node making the values clause do totally different things in this query. Either use //name/text() or /STUFF_FROM_MSDB/SYSTABS/name.

    ...nor did I state they were the same;-)

    It is a demonstration of the fact that not all XML queries are equal

    😎

  • Mikael Eriksson SE (7/13/2015)


    Doing it with a ten million leaf nodes instead of just four might deliver quite different results. :hehe:

    FYI: the plan will be identical regardless of what your XML variable contains. SQL Server does not poke around in it ti figure out the most optimal way to extract data.

    Performance however will of course be different and if bad things happen when adding more rows that could not be foreseen with less rows you have issues in the table valued functions. That is however mostly not the case. The issues you get with performance in XML queries in SQL Server is visible in the plan. Sometimes you can do something to fix it sometimes not.

    The plan may look the same but the amount of work can be different, there is no magic and the server does different amount work (iterations in the XML Reader table value functions), the memory used can be very different, all factors that affects the end results. SQL Server is a cleaver piece of software but it doesn't figure things out, it works them out.

    😎

  • TomThomson (7/13/2015)


    Eirikur Eiriksson (7/13/2015)


    This is a very trivial example, the structure of the XML and the size makes it ill fit for any performance testing.

    😎

    Yes, quite so. 😎

    Doing it with a ten million leaf nodes instead of just four might deliver quite different results. :hehe:

    Most people I know who are involved in computing would regard a data extraction performance test using a table with just four rows as a total waste of time (and the few who wouldn't are idiots :crazy: ).

    I'm inclined to think that measuring performance of data extraction from just 4 leaf nodes in a uniform shallow XML tree is about as useful as a performance test on extracting the data from a table with just 4 rows.

    Maybe so, but look at the XML plans for the larger data set versus the small data set - same plan. By using the text() it eliminates that function call which is deadly in XML.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLRNNR (7/13/2015)


    Maybe so, but look at the XML plans for the larger data set versus the small data set - same plan. By using the text() it eliminates that function call which is deadly in XML.

    The plans in these cases are an indication of how the work is done, not a quantification of the effort, all you'll see are compute scalar and xml reader TVFNs, there are no statistics or other cardinality informations for the server to work on. More complex XML generally implies more complex tabular destinations, things get really complex really quickly when tracking relations, order of appearance, external (outside the XML) relations/references etc.

    😎

Viewing 15 posts - 16 through 30 (of 47 total)

You must be logged in to reply to this topic. Login to reply