Gain Space Using XML data type

  • Comments posted to this topic are about the item Gain Space Using XML data type

  • Hi,

    I notice you're using very short tag names in your xml presumably to reduce space used.

    If you use a typed xml column i.e. associate it with an xsd, the length of the tags shouldn't matter, and you may get other associated benefits. When I've some time I'll do a comparison.

    Regards,

    David McKinney.

  • Hi,

    You can reduce the space similarly without XML. You can use the table instead of the XML column.

    CREATE TABLE Rainfall_LocDate

    (LocDateId int IDENTITY(1, 1)

    ,LocID int

    ,Date datetime

    ,CONSTRAINT PK_Rainfall_LocDate PRIMARY KEY NONCLUSTERED(LocID,Date)

    )

    CREATE TABLE Rainfall_Data

    (LocDateId int

    ,Hour tinyint

    ,Rainfall numeric(10,8)

    ,CONSTRAINT PK_Rainfall_Data PRIMARY KEY NONCLUSTERED(LocDId,Hour)

    )

    Regards,

    Pavel Krticka

  • The space savings are impressive and it does lend itself to a number of real-world situations but you also need to highlight the shortcomings of the XML data type as shown in SQL books online

    http://msdn.microsoft.com/en-us/library/ms189887.aspx

    Limitations of the xml Data Type

    Note the following general limitations that apply to the xml data type:

    The stored representation of xml data type instances cannot exceed 2 GB.

    It cannot be used as a subtype of a sql_variant instance.

    It does not support casting or converting to either text or ntext. Use varchar(max) or nvarchar(max) instead.

    It cannot be compared or sorted. This means an xml data type cannot be used in a GROUP BY statement.

    It cannot be used as a parameter to any scalar, built-in functions other than ISNULL, COALESCE, and DATALENGTH.

    It cannot be used as a key column in an index. However, it can be included as data in a clustered index or explicitly added to a nonclustered index by using the INCLUDE keyword when the nonclustered index is created

  • A very interesting article indeed!

    However, I think that you shouldn't compare query performance without any index in the conventional table.

    If you add an index (code below) and rerun your query, your query performance will be about 1000 times better. (Yet, the space requirement will also increase remarkably.)

    Rgds,

    Pexi

    CREATE NONCLUSTERED INDEX [IX_Rainfall_Con_Date_Hour] ON [Rainfall_Con]

    (

    [Date] ASC,

    [Hour] ASC,

    [LocID] ASC,

    [Rainfall] ASC

    )WITH (

    PAD_INDEX = OFF,

    STATISTICS_NORECOMPUTE = OFF,

    SORT_IN_TEMPDB = OFF,

    IGNORE_DUP_KEY = OFF,

    DROP_EXISTING = OFF,

    ONLINE = OFF,

    ALLOW_ROW_LOCKS = ON,

    ALLOW_PAGE_LOCKS = ON)

    ON [PRIMARY]

  • This script is closely based on that in the article.

    I would strongly encourage the author of this article to try it out 🙂

    Nice idea for an article, but size and speed are not XML strengths! 😉

    Paul

    -- *****************************************************************************************

    -- WARNING: This script contains commands to clear the server-wide data and procedure caches

    -- Do not run on anything other than a TEST instance!

    --

    -- Please ensure Actual Execution Plan is OFF for representative results

    -- Total test time is 10 to 60 seconds (including test data creation), depending on hardware

    -- *****************************************************************************************

    USE tempdb;

    GO

    --

    -- Drop the test tables if they exist

    --

    IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con

    IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;

    IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

    GO

    --

    -- Create the conventional table

    --

    CREATE TABLE dbo.Rainfall_Con

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- Populate the conventional table with one year's data for 100 locations

    -- This should complete in less than 30 seconds

    --

    DECLARE @Rows BIGINT;

    SET @Rows = 100 * 365 * 24; -- 24 hours' data for 365 days for 100 locations

    RAISERROR('Creating test data...', 0, 1) WITH NOWAIT;

    ;WITH Numbers (N)

    AS (

    -- Dynamically create rows numbered 0 to 875999

    SELECT TOP (@Rows)

    ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1

    FROM master.sys.all_columns C1,

    master.sys.all_columns C2,

    master.sys.all_columns C3

    )

    INSERT dbo.Rainfall_Con WITH (TABLOCKX)

    (

    [LocID],

    [Date],

    [Hour],

    [Rainfall]

    )

    SELECT LocID = (N / 365 / 24) + 1,

    [Date] = DATEADD(DAY, (N / 24) % 366 , '2008-01-01 00:00:00.000'),

    [Hour] = N % 24,

    [Rainfall] = RAND(CHECKSUM(NEWID())) * 10 -- A new random value for each row

    FROM Numbers;

    GO

    --

    -- Now create the primary key

    -- Doing this *after* inserting the data results in a compact and contiguous index

    -- If the index were to exist during the insert, index pages may split resulting in

    -- a lower data density, and significant fragmentation

    --

    RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

    ALTER TABLE dbo.Rainfall_Con

    ADD CONSTRAINT PK_Rainfall_Con

    PRIMARY KEY CLUSTERED([Date], [Hour], [LocID])

    WITH (FILLFACTOR = 100);

    GO

    --

    -- Create the XML table

    --

    CREATE TABLE dbo.Rainfall_XML

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Rainfall] XML NOT NULL,

    );

    GO

    --

    -- Populate the XML table from the conventional table

    --

    RAISERROR('Creating XML data...', 0, 1) WITH NOWAIT;

    INSERT Rainfall_XML WITH (TABLOCKX)

    SELECT [LocID],

    [Date],

    (

    -- The hourly data as XML

    SELECT H = [Hour],

    V = [Rainfall]

    FROM dbo.Rainfall_Con A WITH (READUNCOMMITTED)

    WHERE A.[LocID] = B.[LocID]

    AND A.[Date] = B.[Date]

    FOR XML RAW('RN'), ROOT('RT')

    )

    FROM Rainfall_Con B WITH (READUNCOMMITTED)

    GROUP BY

    [LocID],

    [Date];

    GO

    --

    -- Add the primary key now, for the same reasons as before

    --

    RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

    ALTER TABLE dbo.Rainfall_XML

    ADD CONSTRAINT PK_Rainfall_XML

    PRIMARY KEY CLUSTERED([Date], [LocID])

    WITH (FILLFACTOR = 100);

    GO

    --

    -- Show the space used by each table

    --

    -- Results on my system:

    --

    -- Conventional : 876,000 rows, 27016KB reserved, 26856KB data, 96KB index, 64KB unused

    -- XML : 36,500 rows, 36504KB reserved, 36504KB data, 112KB index, 64KB unused

    --

    -- ### XML representation is 35% *larger* ###

    --

    EXECUTE sp_spaceused Rainfall_Con;

    EXECUTE sp_spaceused Rainfall_XML;

    GO

    --

    -- This table is used to capture the results from the test

    -- SELECT statements. This avoids any delays in sending the

    -- rows to the client from skewing the results

    --

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- TEST START

    --

    RAISERROR('Starting test', 0, 1) WITH NOWAIT;

    --

    -- Remove all ad-hoc SQL query plans from cache

    --

    DBCC FREESYSTEMCACHE (N'SQL Plans');

    GO

    --

    -- Start with a cold data cache

    -- Each implementation is run 10 times

    -- One time with a cold cache, then nine times with a warm cache

    -- This seems more representative to me, feel free to change it

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* SQL */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.[LocID],

    TP.[Date],

    TP.[Hour],

    TP.[Rainfall]

    FROM dbo.Rainfall_Con TP WITH (READUNCOMMITTED)

    WHERE TP.[Date] = '20080106'

    AND TP.[Hour] = 15

    ORDER BY

    LocID ASC;

    GO 10

    --

    -- For fairness, recreate the dump table for the second run

    -- TRUNCATE TABLE might give the second run an unfair advantage

    -- since mixed and uniform extent allocations would not need

    -- to be re-done

    --

    DROP TABLE #BitBucket;

    GO

    CREATE TABLE #BitBucket

    (

    [LocID] INTEGER NOT NULL,

    [Date] DATETIME NOT NULL,

    [Hour] TINYINT NOT NULL,

    [Rainfall] DECIMAL(10,8) NOT NULL,

    );

    GO

    --

    -- Second run - XML

    --

    DBCC DROPCLEANBUFFERS;

    GO

    /* XML */

    INSERT #BitBucket WITH (TABLOCKX)

    SELECT TP.LocID,

    TP.[Date],

    [Hour] = Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint'),

    Rainfall_Con = Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)')

    FROM dbo.Rainfall_XML TP WITH (READUNCOMMITTED)

    WHERE TP.[Date] = '20080106'

    ORDER BY

    LocID;

    GO 10

    --

    -- Show the total execution resource usage for all ten runs

    --

    SELECT query = LEFT(QT.[text], 9),

    execution_plan = QP.query_plan,

    run_count = QS.execution_count,

    total_cpu_time_µs = QS.total_worker_time,

    total_logical_reads = QS.total_logical_reads,

    total_elapsed_time_µs = QS.total_elapsed_time,

    avg_cpu_time_µs = QS.total_worker_time / QS.execution_count,

    avg_logical_reads = QS.total_logical_reads / QS.execution_count,

    avg_elapsed_time_µs = QS.total_elapsed_time / QS.execution_count

    FROM sys.dm_exec_query_stats QS

    CROSS

    APPLY sys.dm_exec_sql_text (QS.[sql_handle]) QT

    CROSS

    APPLY sys.dm_exec_query_plan (QS.[plan_handle]) QP

    WHERE QT.[text] LIKE '%Rainfall%'

    AND QT.[text] NOT LIKE '%dm_exec_query_stats%'

    ORDER BY

    QS.last_execution_time ASC;

    GO

    --

    -- Sample results:

    --

    -- SQL Averages: CPU 683µs; Elapsed time: 683µs; Logical reads: 114

    -- XML Averages: CPU 80078µs; Elapsed time: 83691µs; Logical reads: 126

    --

    -- Tidy up

    --

    GO

    IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con

    IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;

    IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

    GO

    RAISERROR('Test run complete', 0, 1) WITH NOWAIT;

    --

    -- END SCRIPT

    --

  • Nice article, I learned something about XML in SQL Server today.

    As Paul pointed out, it's impossible that XML storage needs less space than conventional storage.

    With appropriate indexing, there's no way that XML works faster than a relational table.

    I suggest that you add some more comments to your article to avoid misleading somebody on this topic.

    Nice idea, anyway.

    Gianluca

    -- Gianluca Sartori

  • If you make your primary key in both tables CLUSTERED (which would be a reasonable choice in this case given that some of the queries are ORDERing/GROUPing by locid), the storage requirements are somewhat different... XML now uses more storage.

    name rows reserved data index_size unused

    Rainfall_XML 36600 36744 KB 36600 KB 112 KB 32 KB

    name rows reserved data index_size unused

    Rainfall_Con 878400 27080 KB 26928 KB 96 KB 56 KB

  • Where I've found the XML data type useful has been in situations where you can't control schema changes to a relation (table) and want to record all data from it even if columns are added, modified or removed.

    Logging triggers, or archiving processes in these cases can take advantage of "select * from table for xml" and insert into an XML column in a log/archive/audit table.

    For semi-structured data that has variable schema elements, XML is pretty ideal. Beyond that, relational data structures are better.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Another thing to point out about XML's use of space is that, barring compression, the data is being stored as characters wrapped in tags, making storage of integer, numeric, or floating point data LESS efficient.

    Datatype SQL Length XML XML Length

    --------- ----------- ---- ------------

    TinyInt 1 Byte 255 10

    SmallInt 2 Bytes 32767 12

    Int 4 Bytes 2147483647 17

    FLOAT 8 Bytes -2.23.40E-308 20

    In the author's example, he chose numeric(10,8) to store the rainfall amount, which would require only 9 bytes for SQL to store, but 18 for XML to store (90.12345678).

    The author also chose to impose an [hour] column on the conventional table, which was unnecessary, because it could have been included as part of the [date] column.

    Finally, he could have chosen to have a separate column for each hour in his schema for the conventional table, rather than making it one row per hour. (That wouldn't have been my first choice, because I feel it lacks flexibility in manipulating the data. But it is essentially the approach that was taken with the table that used XML.)

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • It's an interesting look from a developer point of view. Compared to the power and speed of the relational engine, I think space savings are a minimal thing to worry about. Disk prices fall at a rate far beyond that of the improvements in CPU and memory when you compare price/performance.

    I agree with many of the comments, that a well designed relational system is probably better, but this is an interesting solution and idea. If disk space matters a lot, perhaps it's worth considering, but I would recommend that you investigate some of the other suggestions first.

  • I think the merit in this article lies more in the techniques that are covered, rather than in the premise that it ultimately failed to prove.

    The article does show a couple of good xquery examples in context - how to select a value from a node and incorporate it in a resultset, and how to modify the value of a node.

    SELECT Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint') AS Hour

    ,Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)') AS Rainfall

    UPDATE TP SET Rainfall.modify('replace value of (/RT/RN[@H="20"]/@V)[1] with "8.12345678" ')

    All in all, not a bad article, well written and clear explanations, even if in some respects fundamentally flawed!

  • Paul White (8/4/2009)


    This script is closely based on that in the article.

    I would strongly encourage the author of this article to try it out 🙂

    Nice idea for an article, but size and speed are not XML strengths! 😉

    Paul

    MythBuster extraordinaire... that script goes into my "Pork Chop Hall of Fame". Nice job, Paul.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Steve Jones - Editor (8/4/2009)


    It's an interesting look from a developer point of view. Compared to the power and speed of the relational engine, I think space savings are a minimal thing to worry about.

    Space savings are certainly important to me especially when it comes to backups and restores but according to Paul's script... what space savings?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "Dear Lord... I'm a DBA so please give me patience because, if you give me strength, I'm going to need bail money too!"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Didn't test it, and perhaps there are none. In any case, I think it's been a nice debate/discussion of the technique.

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply