Gain Space Using XML data type

  • Very interesting indeed! Thanks for the great write up, Paul. That certainly puts a different slant on "the rumor".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Excellent explanation, Paul. Perhaps we should have you writing XML articles 😉

  • Hi Paul

    I executed the script and it does prove your point. The script is indeed an intelligent and efficient option. It has made me rethink over our XML implementation which has been in place here for quite some time now. Besides its given me some very important tips which I could really make use of.

    The XML design however still stands out here for some situational reasons mentioned below:

    1. Index: Thats the basic difference in the scripts. As mentioned in the article, this design is for an OLTP table wherein data is continuously inserted and the index has to be already in place. Secondly the data is not necessarily in order of columns mentioned in clustered index, not even in order of date. It is uploaded as and when different locations send in their data to the storage center. That would lead to fragmentation and ultimately slow performance. Hence we had to take a decision of dropping the clustered index and creating a non clustered index. That was even before moving to XML. That way space requirement was lesser for XML. Again we could never have Hour before Location in the order of the index for the same reason.

    2. When a particular location sends in its data, the application preferably sends in the complete data set to the database rather than sending in one datapoint at a time. And what better format to pass data than XML itself. In conventional format, the SQL procedure would have to parse that XML and then perform the insert while in the XML implementation, we just have to insert the XML as-it-is into the table.

    3. One very important point again only relevant to the situation is the way the data is retrieved back by the application. As I had mentioned, most of the times application requires a complete dataset and not individual datapoints. In such cases the database returns the dataset as an XML as-it-is instead of tabular resultset. The application has better tools of mapping the XML to a graph or any other display. The only reason why the dataset is not stored in binary format is for those rare cases wherein SQL itself has to dig for an individual datapoint and that is the scenario used in your script.

    I would request you to check for a data insert scenario using clustered index and compare the results. Again please do see the results when the table sends the data in an XML format directly to the application. I tried the following queries:

    Data entry scenario (on top of 3 years data):

    DBCC FREESYSTEMCACHE (N'SQL Plans');

    GO

    DBCC DROPCLEANBUFFERS;

    GO

    /* XML */

    DECLARE @data xml

    SELECT @data = ''

    INSERT Rainfall_XML(LocID,Date,Rainfall) SELECT 1,'20060125',@Data

    GO

    DBCC DROPCLEANBUFFERS;

    GO

    /* SQL */

    DECLARE @data xml

    SELECT @data = ''

    INSERT Rainfall_Con(LocID,Date,[Hour],Rainfall)

    SELECT 1,'20060125',T.Item.value('./@H','tinyint') AS Hour,T.Item.value('./@V','numeric(10,8)') AS Rainfall FROM @data.nodes('/RT/RN') AS T(Item)

    GO

    Data retrieval in XML format:

    IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

    GO

    --

    -- This table is used to capture the results from the test

    -- SELECT statements. This avoids any delays in sending the

    -- rows to the client from skewing the results

    --

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- TEST START

    --

    RAISERROR('Starting test', 0, 1) WITH NOWAIT;

    --

    -- Remove all ad-hoc SQL query plans from cache

    --

    DBCC FREESYSTEMCACHE (N'SQL Plans');

    GO

    --

    -- Start with a cold data cache

    -- Each implementation is run 10 times

    -- One time with a cold cache, then nine times with a warm cache

    -- This seems more representative to me, feel free to change it

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* SQL */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.[LocID],

    TP.[Date],

    TP.[Hour],

    TP.[Rainfall]

    FROM dbo.Rainfall_Con TP WITH (READUNCOMMITTED)

    WHERE TP.LocID IN (1,4,6,8) AND TP.Date = '20080106'

    GO 10

    --

    -- For fairness, recreate the dump table for the second run

    -- TRUNCATE TABLE might give the second run an unfair advantage

    -- since mixed and uniform extent allocations would not need

    -- to be re-done

    --

    DROP TABLE #BitBucket;

    GO

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Rainfall] XML NOT NULL,

    );

    GO

    --

    -- Second run - XML

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* XML */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.[LocID],

    TP.[Date],

    TP.Rainfall

    FROM Rainfall_XML TP WITH (READUNCOMMITTED)

    WHERE TP.LocID IN (1,4,6,8) AND TP.Date = '20080106'

    -- (4 row(s) affected)

    GO 10

    I do agree that this design cannot be generalized at all and would apply to only a limited number of scenarios. Your feedback would be sincerely appreciated.

    Thanks

    Satish More

  • Hey Satish,

    Many thanks for the comprehensive reply, and for the extra details. I should make it clear that I am not at all anti-XML; I use it frequently wherever appropriate. The point of the two scripts I posted was to challenge some of the assertions made in the article, not to try to claim that a non-XML solution is always to be preferred. There are very few absolutes in the SQL Server world. 🙂

    That said, I do want to respond to some of the new points raised:

    Whenever data is inserted out of sequence, page splits may occur and affected indexes will fragment. This obviously applies to clustered and non-clustered indexes alike. We should remember that this is only a performance issue if large range scans are performed regularly - logical fragmentation has no effect on singleton selects.

    If page splitting and fragmentation do cause measurable performance problems, there are many ways to mitigate the issue. The affected indexes can be created with a fillfactor which retains just enough free space (at the leaf level, and optionally at the intermediate levels using PAD_INDEX) to ensure that pages do not split before the next index reorganization or rebuild. In Enterprise Edition, online index operations and the ability to rebuild a single partition (offline) make this less problematic on systems which must be continuously available.

    As far as the space implications of extra non-clustered indexes are concerned, I would ask you to take a look at the second script I posted, which uses a denormalized structure. This achieves the best space savings of all, and is the fastest at returning results - significantly improving on the previous non-XML implementation.

    We should also remember that the XML data may require several XML indexes (including a primary index which will be a complete copy of the XML data). A final point on XML is that once the data length exceeds 8000 bytes, the data is physically stored as a LOB - exactly as for text and image columns (as well as the MAX types) - with all the potential performance issues that implies.

    Whether the extra processing on INSERT is worthwhile depends on your requirements. If you are processing millions of new rows each day, it might be an issue. Often it does not really mater too much. If the application is sending large blocks of XML, it might be worth breaking it up using the nodes method and storing it as denormalized values simply to avoid the LOB issues mentioned. Equally, it may make sense to store the XML as-is. A lot depends on the queries which will access the data - both for retrieval by the application, and for any reporting you may have. Retrieving data as XML is fairly trivial since we can use FOR XML to efficiently construct XML from the denormalized data in a simple SELECT.

    So, there are always reasons for and against. I would probably be tempted to stay with XML storage in your situation - it seems a good fit, so long as ad-hoc data access can be made to work efficiently. The key factor for me is that the data is not large enough to warrant heavy optimization on the server, and the application clearly prefers to work with XML.

    My main point is that it is normally possible to produce a non-XML solution which uses less space and performs faster, but that is not often the overriding concern. Thanks again for an interesting discussion!

    Paul

  • Thanks a lot again Paul for the detailed information. Your every post gets me fresh and new useful tips.:-)

  • Sunny (8/10/2009)


    Thanks a lot again Paul for the detailed information. Your every post gets me fresh and new useful tips.:-)

    ...and for those of us watching from the sidelines, it's been great too!

    Thanks for all the effort, guys!

  • Yes, this turned out to be another great discussion - as so often happens on SSC, an article prompts conversation which is just as interesting as the article itself, in many ways. Cheers everyone, whether you actively participated or 'watched from the sidelines'!

    Paul

  • Just be aware that XML indexes are huge so if you want to get any performance out of querying the data, you're actually going to end up using a lot more space!!


    My blog: SQL Soldier[/url]
    SQL Server Best Practices:
    SQL Server Best Practices
    Twitter: @SQLSoldier
    My book: Pro SQL Server 2008 Mirroring[/url]
    Microsoft Certified Master: SQL Server, Data Platform MVP
    Database Engineer at BlueMountain Capital Management[/url]

Viewing 8 posts - 31 through 37 (of 37 total)

You must be logged in to reply to this topic. Login to reply