Gain Space Using XML data type

  • Steve Jones - Editor (8/4/2009)


    Didn't test it, and perhaps there are none. In any case, I think it's been a nice debate/discussion of the technique.

    Hey Steve!

    CTRL+C, CTRL+V, F5 :laugh:

    (The test results from a run on my machine are also embedded in the comments).

    Seriously though, XML is cool an' all - and I have been known to dabble some with XQuery, XPath and the rest, but it is easy to get carried away with it and apply XML to everything.

    @G2: Good point. Sparse columns have some advantages in 2K8 that make them well worth a look too. The column_set column that gives you formatted XML based on just the columns with values in is very cool.

    @jeff: Thanks so much - have you noticed how many of my scripts have at least a passing resemblance to some you have posted in the past? Not entirely co-incidental 😉

  • Paul White (8/4/2009)


    @Jeff: Thanks so much - have you noticed how many of my scripts have at least a passing resemblance to some you have posted in the past? Not entirely co-incidental 😉

    Heh... oboy have I noticed. Thank you for the high compliment. The embedded comments are the berries, too! If someone can't figure out what you've done by looking at your code, it's because they simply can't read. Nicely done.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks to all for the comments and suggestions. Really appreciate them.

    Paul - Thanks for the script. I shall go through it during the weekend and let you know.

  • The application of XML in an OLTP system is not very scalable. The goal of any OLTP system is to increate the TPS. XML is not built for speed. Also consider that most OLTP sytems are highly normalized. Good luck joining from XML columns out to other tables and retaining your performance.

    I see XML fields more applicable in static data scenarios. Also don't forget that SQL Server provides primary and secondary XML index functionality. This can be used to increase query performance over your XML, but great care is needed in designing your XML schema.

    Steve

  • Why bother with the database at all then? Just store the XML directly on the disk? 😀

    Seriously though, in today's day and age physical space is not usually an issue.

    And, if you are going to be querying the data often, you will get better performance from a relational model.

    The example with Candidates and Tests hit close to home.

    We have a similar scenario and the original decision (made years ago) was to store the data as XML, for some of the reasons mentioned in the article.

    But -- guess what? -- querying the data became slower and slower...

    Eventually we had to write an APP that breaks the XML back into relational tables. :w00t:

  • So don't get me wrong: XML support is a useful addition to the SQL Server toolkit - even in OLTP applications. As usual it's a case of using the right tool for the job.

    My main objection to this article is the claim that the equivalent XML structure is either smaller or faster than the 'traditional' equivalent. Unless we deliberately do something dumb, the traditional approach will smaller, and faster -I can't even think of an 'edge' case where this isn't true (provided dumbness is avoided).

    XML has definite advantages, including for semi-structured data and communicating with external components like web services...it's just not a panacea, especially when it comes to storage efficiency or outright speed.

    The article does include some great examples of query syntax with XML though!

    Paul

  • As a side bar... There is a rumor (and I STRESS the word RUMOR!) I heard and I can't confirm it or deny it. The rumor says that when you use any of the XML extensions in SQL Server, that it automatically allocates at least 1/8th of the available memory to the task. Does anyone know of a reference that can confirm or deny that rumor?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/5/2009)


    As a side bar... There is a rumor (and I STRESS the word RUMOR!) I heard and I can't confirm it or deny it. The rumor says that when you use any of the XML extensions in SQL Server, that it automatically allocates at least 1/8th of the available memory to the task. Does anyone know of a reference that can confirm or deny that rumor?

    I think this is the thing that causes the confusion: sp_xml_preparedocument

    So that's just saying that when you use the XML parser to open a document for use with OPENXML, up to one-eighth of SQL Server memory is potentially available to Msxmlsql.dll - should it need it.

    This is the first MSFT answer I found with Google to back that up: MSDN Forums Thread[/url].

    My impression (though the method has so far escaped deprecation) is that XML DML methods/XQuery/XPath/FOR XML and the gang are to be preferred to using sp_xml_preparedocument and OPENXML wherever possible. There are many difficulties associated with OPENXML so personally I avoid it.

    To be clear, all that applies specifically to prepare/remove document and OPENXML. XML variables, columns and so on allocate memory in the usual fashion. That applies to the XML portion of the query optimizer too. One interesting thing about xml variables (like all the MAX datatypes when used as variables) is that they require tempdb workspace.

    Many of the XML features use streaming interfaces to minimize memory consumption - bulk loading XML is a good example of this. See XML Support in Microsoft SQL Server 2005

    Paul

  • Here's an interesting variation:

    This script is the same as the last, except it also includes a test run with a denormalized version.

    Sample results included in the script comments as before.

    (The script also includes commented-out sections to try using vardecimal 'compression' too)

    Paul

    -- *****************************************************************************************

    -- WARNING: This script contains commands to clear the server-wide data and procedure caches

    -- Do not run on anything other than a TEST instance!

    --

    -- Please ensure Actual Execution Plan is OFF for representative results

    -- Total test time is 30s to 1 minute (including test data creation), depending on hardware

    -- *****************************************************************************************

    --

    -- Create a test database

    --

    USE master;

    GO

    RAISERROR('Creating test database...', 0, 1) WITH NOWAIT;

    CREATE DATABASE [E059791C-4F24-40CF-A54A-B21C34F6610D];

    GO

    --

    -- Switch to it

    --

    USE [E059791C-4F24-40CF-A54A-B21C34F6610D];

    GO

    --

    -- Enable vardecimal format

    --

    --EXECUTE sp_db_vardecimal_storage_format

    -- @dbname = [E059791C-4F24-40CF-A54A-B21C34F6610D],

    -- @vardecimal_storage_format = 'ON';

    GO

    --

    -- Drop the test tables if they exist

    --

    IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con

    IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;

    IF OBJECT_ID(N'dbo.Rainfall_DN', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con_DN;

    IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

    GO

    --

    -- Create the conventional table

    --

    CREATE TABLE dbo.Rainfall_Con

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- Populate the conventional table with one year's data for 100 locations

    -- This should complete in less than 30 seconds

    --

    DECLARE @Rows BIGINT;

    SET @Rows = 100 * 365 * 24; -- 24 hours' data for 365 days for 100 locations

    RAISERROR('Creating test data...', 0, 1) WITH NOWAIT;

    ;WITH Numbers (N)

    AS (

    -- Dynamically create rows numbered 0 to 875999

    SELECT TOP (@Rows)

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1

    FROM master.sys.all_columns C1,

    master.sys.all_columns C2,

    master.sys.all_columns C3

    )

    INSERT dbo.Rainfall_Con WITH (TABLOCKX)

    (

    [LocID],

    [Date],

    [Hour],

    [Rainfall]

    )

    SELECT LocID = (N / 365 / 24) + 1,

    [Date] = DATEADD(DAY, (N / 24) % 366 , '2008-01-01 00:00:00.000'),

    [Hour] = N % 24,

    [Rainfall] = RAND(CHECKSUM(NEWID())) * 10 -- A new random value for each row

    FROM Numbers;

    GO

    --

    -- Now create the primary key

    -- Doing this *after* inserting the data results in a compact and contiguous index

    -- If the index were to exist during the insert, index pages may split resulting in

    -- a lower data density, and significant fragmentation

    --

    RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

    ALTER TABLE dbo.Rainfall_Con

    ADD CONSTRAINT PK_Rainfall_Con

    PRIMARY KEY CLUSTERED([Date], [Hour], [LocID])

    WITH (FILLFACTOR = 100);

    GO

    --

    -- Create the XML table

    --

    CREATE TABLE dbo.Rainfall_XML

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Rainfall] XML NOT NULL,

    );

    GO

    --

    -- Populate the XML table from the conventional table

    --

    RAISERROR('Creating XML data...', 0, 1) WITH NOWAIT;

    INSERT Rainfall_XML WITH (TABLOCKX)

    SELECT [LocID],

    [Date],

    (

    -- The hourly data as XML

    SELECT H = [Hour],

    V = [Rainfall]

    FROM dbo.Rainfall_Con A WITH (READUNCOMMITTED)

    WHERE A.[LocID] = B.[LocID]

    AND A.[Date] = B.[Date]

    FOR XML RAW('RN'), ROOT('RT')

    )

    FROM Rainfall_Con B WITH (READUNCOMMITTED)

    GROUP BY

    [LocID],

    [Date];

    GO

    --

    -- Add the primary key now, for the same reasons as before

    --

    RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

    ALTER TABLE dbo.Rainfall_XML

    ADD CONSTRAINT PK_Rainfall_XML

    PRIMARY KEY CLUSTERED([Date], [LocID])

    WITH (FILLFACTOR = 100);

    GO

    --

    -- Create the denormalized table

    --

    CREATE TABLE dbo.Rainfall_Con_DN

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [0] DECIMAL(10,8) NOT NULL,

    [1] DECIMAL(10,8) NOT NULL,

    [2] DECIMAL(10,8) NOT NULL,

    [3] DECIMAL(10,8) NOT NULL,

    [4] DECIMAL(10,8) NOT NULL,

    [5] DECIMAL(10,8) NOT NULL,

    [6] DECIMAL(10,8) NOT NULL,

    [7] DECIMAL(10,8) NOT NULL,

    [8] DECIMAL(10,8) NOT NULL,

    [9] DECIMAL(10,8) NOT NULL,

    [10] DECIMAL(10,8) NOT NULL,

    [11] DECIMAL(10,8) NOT NULL,

    [12] DECIMAL(10,8) NOT NULL,

    [13] DECIMAL(10,8) NOT NULL,

    [14] DECIMAL(10,8) NOT NULL,

    [15] DECIMAL(10,8) NOT NULL,

    [16] DECIMAL(10,8) NOT NULL,

    [17] DECIMAL(10,8) NOT NULL,

    [18] DECIMAL(10,8) NOT NULL,

    [19] DECIMAL(10,8) NOT NULL,

    [20] DECIMAL(10,8) NOT NULL,

    [21] DECIMAL(10,8) NOT NULL,

    [22] DECIMAL(10,8) NOT NULL,

    [23] DECIMAL(10,8) NOT NULL,

    );

    GO

    --EXECUTE sp_tableoption

    -- @TableNamePattern = [dbo.Rainfall_Con_DN],

    -- @OptionName = 'vardecimal storage format' ,

    -- @OptionValue = 'ON';

    GO

    --

    -- Populate the denormalized table from the conventional table

    --

    RAISERROR('Creating denormalized data...', 0, 1) WITH NOWAIT;

    INSERT dbo.Rainfall_Con_DN WITH (TABLOCKX)

    (

    [LocID],

    [Date],

    [0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],

    [13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23]

    )

    SELECT P.[LocID],

    P.[Date],

    P.[0],P.[1],P.[2],P.[3],P.[4],P.[5],P.[6],

    P.[7],P.[8],P.[9],P.[10],P.[11],P.[12],

    P.[13],P.[14],P.[15],P.[16],P.[17],P.[18],

    P.[19],P.[20],P.[21],P.[22],P.[23]

    FROM (

    SELECT DISTINCT

    [LocID],

    [Date]

    FROM dbo.Rainfall_Con WITH (READUNCOMMITTED)

    ) B

    CROSS

    APPLY (

    -- The hourly data

    SELECT H = [Hour],

    V = [Rainfall]

    FROM dbo.Rainfall_Con A WITH (READUNCOMMITTED)

    WHERE A.[LocID] = B.[LocID]

    AND A.[Date] = B.[Date]

    ) A

    PIVOT (

    MAX(V)

    FOR H

    IN (

    [0],[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],

    [13],[14],[15],[16],[17],[18],[19],[20],[21],[22],[23]

    )

    ) P

    GO

    --

    -- Add the primary key now, for the same reasons as before

    --

    RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

    ALTER TABLE dbo.Rainfall_Con_DN

    ADD CONSTRAINT PK_Rainfall_Con_DN

    PRIMARY KEY CLUSTERED([Date], [LocID])

    WITH (FILLFACTOR = 100);

    GO

    --

    -- Show the space used by each table

    --

    -- Results on my system:

    --

    -- Conventional : 876,000 rows, 27016KB reserved, 26856KB data, 96KB index, 64KB unused

    -- XML : 36,500 rows, 36504KB reserved, 36504KB data, 112KB index, 64KB unused

    -- Denormalized : 36,500 rows, 8968KB reserved, 8856KB data, 40KB index, 72KB unused

    -- Denormalized : 36,500 rows, 7240KB reserved, 7128KB data, 40KB index, 72KB unused (with vardecimal storage)

    --

    -- ### XML representation is 35% *larger* ###

    --

    EXECUTE sp_spaceused Rainfall_Con;

    EXECUTE sp_spaceused Rainfall_XML;

    EXECUTE sp_spaceused Rainfall_Con_DN;

    GO

    --

    -- This table is used to capture the results from the test

    -- SELECT statements. This avoids any delays in sending the

    -- rows to the client from skewing the results

    --

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- TEST START

    --

    RAISERROR('Starting test', 0, 1) WITH NOWAIT;

    --

    -- Remove all ad-hoc SQL query plans from cache

    --

    DBCC FREESYSTEMCACHE (N'SQL Plans');

    GO

    --

    -- Start with a cold data cache

    -- Each implementation is run 10 times

    -- One time with a cold cache, then nine times with a warm cache

    -- This seems more representative to me, feel free to change it

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* SQL */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.[LocID],

    TP.[Date],

    TP.[Hour],

    TP.[Rainfall]

    FROM dbo.Rainfall_Con TP WITH (READUNCOMMITTED)

    WHERE TP.[Date] = '20080106'

    AND TP.[Hour] = 15

    ORDER BY

    LocID ASC;

    GO 10

    --

    -- For fairness, recreate the dump table for the second run

    -- TRUNCATE TABLE might give the second run an unfair advantage

    -- since mixed and uniform extent allocations would not need

    -- to be re-done

    --

    DROP TABLE #BitBucket;

    GO

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- Second run - XML

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* XML */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.LocID,

    TP.[Date],

    [Hour] = Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint'),

    Rainfall_Con = Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)')

    FROM dbo.Rainfall_XML TP WITH (READUNCOMMITTED)

    WHERE TP.[Date] = '20080106'

    ORDER BY

    LocID;

    GO 10

    --

    -- Recreate the dump table for the third run

    --

    DROP TABLE #BitBucket;

    GO

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- Third run - denormalized

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* DN */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.LocID,

    TP.[Date],

    [Hour] = CONVERT(TINYINT, 15),

    Rainfall = [15]

    FROM dbo.Rainfall_Con_DN TP WITH (READUNCOMMITTED)

    WHERE TP.[Date] = '20080106'

    ORDER BY

    LocID;

    GO 10

    --

    -- Show the total execution resource usage for all ten runs

    --

    SELECT query = LEFT(QT.[text], 9),

    execution_plan = QP.query_plan,

    run_count = QS.execution_count,

    total_cpu_time_µs = QS.total_worker_time,

    total_logical_reads = QS.total_logical_reads,

    total_elapsed_time_µs = QS.total_elapsed_time,

    avg_cpu_time_µs = QS.total_worker_time / QS.execution_count,

    avg_logical_reads = QS.total_logical_reads / QS.execution_count,

    avg_elapsed_time_µs = QS.total_elapsed_time / QS.execution_count

    FROM sys.dm_exec_query_stats QS

    CROSS

    APPLY sys.dm_exec_sql_text (QS.[sql_handle]) QT

    CROSS

    APPLY sys.dm_exec_query_plan (QS.[plan_handle]) QP

    WHERE QT.[text] LIKE '%Rainfall%'

    AND QT.[text] NOT LIKE '%dm_exec_query_stats%'

    ORDER BY

    QS.last_execution_time ASC;

    GO

    --

    -- Sample results:

    --

    -- SQL Averages: CPU 390µs; Elapsed time: 4785µs; Logical reads: 113

    -- DN Averages: CPU 488µs; Elapsed time: 1660µs; Logical reads: 117 (without vardecimal)

    -- XML Averages: CPU 76171µs; Elapsed time: 79492µs; Logical reads: 126

    --

    -- Tidy up

    --

    GO

    IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con

    IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;

    IF OBJECT_ID(N'dbo.Rainfall_DN', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con_DN;

    IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

    GO

    USE master;

    GO

    DROP DATABASE [E059791C-4F24-40CF-A54A-B21C34F6610D];

    GO

    RAISERROR('Test run complete', 0, 1) WITH NOWAIT;

    --

    -- END SCRIPT

    --

  • Paul,

    When I copy / paste your script into SSMS or a text editor, it all appears on one line.

    Any ideas what am I doing wrong?

  • ....it's ok, now.

    ...appears to be javascript related!

  • David McKinney (8/6/2009)


    ....it's ok, now.

    ...appears to be javascript related!

    Welcome to the users-frustrated-with-the-site-code-windows club! 🙂

  • Don't forget to include the sp_xml_removedocument or you will have huge problems with your SQL Server instance causing the eventual reboot of the database server.

  • Paul White (8/5/2009)


    Jeff Moden (8/5/2009)


    As a side bar... There is a rumor (and I STRESS the word RUMOR!) I heard and I can't confirm it or deny it. The rumor says that when you use any of the XML extensions in SQL Server, that it automatically allocates at least 1/8th of the available memory to the task. Does anyone know of a reference that can confirm or deny that rumor?

    I think this is the thing that causes the confusion: sp_xml_preparedocument

    So that's just saying that when you use the XML parser to open a document for use with OPENXML, up to one-eighth of SQL Server memory is potentially available to Msxmlsql.dll - should it need it.

    This is the first MSFT answer I found with Google to back that up: MSDN Forums Thread[/url].

    My impression (though the method has so far escaped deprecation) is that XML DML methods/XQuery/XPath/FOR XML and the gang are to be preferred to using sp_xml_preparedocument and OPENXML wherever possible. There are many difficulties associated with OPENXML so personally I avoid it.

    To be clear, all that applies specifically to prepare/remove document and OPENXML. XML variables, columns and so on allocate memory in the usual fashion. That applies to the XML portion of the query optimizer too. One interesting thing about xml variables (like all the MAX datatypes when used as variables) is that they require tempdb workspace.

    Many of the XML features use streaming interfaces to minimize memory consumption - bulk loading XML is a good example of this. See XML Support in Microsoft SQL Server 2005

    Paul

    Ah... got it. My only confusion now would be the same as on the MSDN thread because the note in BOL simply states "The MSXML parser uses one-eighth the total memory available for SQL Server." It doesn't say "up to"...

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (8/6/2009)


    Ah... got it. My only confusion now would be the same as on the MSDN thread because the note in BOL simply states "The MSXML parser uses one-eighth the total memory available for SQL Server." It doesn't say "up to"...

    Yes the wording could be better. I refreshed my memory on the detail last night (one of Ken Henderson's excellent books) so I just want to add:

    SQL Server uses MSXML and the DOM to process documents you load via sp_xml_preparedocument. It restricts the virtual memory MSXML can use for DOM processing to one-eighth of the physical memory on the machine or 500MB, whichever is less.

    It's unlikely that MSXML would be able to get 500MB of virtual memory since SQL Server reserves most of the user mode address space for use by the buffer cache. MSXML allocates memory via the multi-page allocator (the so-called MemToLeave region) which has a default maximum size of 256MB. Since that region is used for all sorts of stuff (by SQL Server as well) it's likely to be much less than that.

    MSXML is limited to 500MB of virtual memory regardless of the amount of memory on the machine because SQL Server calls the GlobalMemoryStatus Win32 API function to determine the amount of available physical memory. On machines with more than 4GB of physical memory, GlobalMemoryStatus can return incorrect information, so Windows returns a -1 to indicate an overflow. The Win32 API function GlobalMemoryStatusEx exists to address this shortcoming, but SQLXML does not call it. :laugh:

    One last thing that I had forgotten - MSXML had an ordinal-only DLL entryn point added specifically for SQL Server to allow it to set the maximum virtual memory to use. Interesting stuff!

    Paul

Viewing 15 posts - 16 through 30 (of 37 total)

You must be logged in to reply to this topic. Login to reply