Gain Space Using XML data type

Question

Post reply

Gain Space Using XML data type

Sunny-138471

Ten Centuries

Points: 1162
More actions
July 23, 2009 at 12:14 am

#385037

Comments posted to this topic are about the item Gain Space Using XML data type

Viewing 15 posts - 1 through 15 (of 38 total)

You must be logged in to reply to this topic. Login to reply

David McKinney SSChampion Points: 10472 More actions · Answer 1

Hi,

I notice you're using very short tag names in your xml presumably to reduce space used.

If you use a typed xml column i.e. associate it with an xsd, the length of the tags shouldn't matter, and you may get other associated benefits. When I've some time I'll do a comparison.

Regards,

David McKinney.

hek Valued Member Points: 65 More actions · Answer 2

Hi,

You can reduce the space similarly without XML. You can use the table instead of the XML column.

CREATE TABLE Rainfall_LocDate

(LocDateId int IDENTITY(1, 1)

,LocID int

,Date datetime

,CONSTRAINT PK_Rainfall_LocDate PRIMARY KEY NONCLUSTERED(LocID,Date)

)

CREATE TABLE Rainfall_Data

(LocDateId int

,Hour tinyint

,Rainfall numeric(10,8)

,CONSTRAINT PK_Rainfall_Data PRIMARY KEY NONCLUSTERED(LocDId,Hour)

)

Regards,

Pavel Krticka

gkrs Old Hand Points: 320 More actions · Answer 3

The space savings are impressive and it does lend itself to a number of real-world situations but you also need to highlight the shortcomings of the XML data type as shown in SQL books online

http://msdn.microsoft.com/en-us/library/ms189887.aspx

Limitations of the xml Data Type

Note the following general limitations that apply to the xml data type:

The stored representation of xml data type instances cannot exceed 2 GB.

It cannot be used as a subtype of a sql_variant instance.

It does not support casting or converting to either text or ntext. Use varchar(max) or nvarchar(max) instead.

It cannot be compared or sorted. This means an xml data type cannot be used in a GROUP BY statement.

It cannot be used as a parameter to any scalar, built-in functions other than ISNULL, COALESCE, and DATALENGTH.

It cannot be used as a key column in an index. However, it can be included as data in a clustered index or explicitly added to a nonclustered index by using the INCLUDE keyword when the nonclustered index is created

Pekka Heimonen SSC Journeyman Points: 84 More actions · Answer 4

A very interesting article indeed!

However, I think that you shouldn't compare query performance without any index in the conventional table.

If you add an index (code below) and rerun your query, your query performance will be about 1000 times better. (Yet, the space requirement will also increase remarkably.)

Rgds,

Pexi

CREATE NONCLUSTERED INDEX [IX_Rainfall_Con_Date_Hour] ON [Rainfall_Con]

(

[Date] ASC,

[Hour] ASC,

[LocID] ASC,

[Rainfall] ASC

)WITH (

PAD_INDEX = OFF,

STATISTICS_NORECOMPUTE = OFF,

SORT_IN_TEMPDB = OFF,

IGNORE_DUP_KEY = OFF,

DROP_EXISTING = OFF,

ONLINE = OFF,

ALLOW_ROW_LOCKS = ON,

ALLOW_PAGE_LOCKS = ON)

ON [PRIMARY]

Paul White SSC Guru Points: 150467 More actions · Answer 5

This script is closely based on that in the article.

I would strongly encourage the author of this article to try it out 🙂

Nice idea for an article, but size and speed are not XML strengths! 😉

Paul

-- *****************************************************************************************

-- WARNING: This script contains commands to clear the server-wide data and procedure caches

-- Do not run on anything other than a TEST instance!

--

-- Please ensure Actual Execution Plan is OFF for representative results

-- Total test time is 10 to 60 seconds (including test data creation), depending on hardware

-- *****************************************************************************************

USE tempdb;

GO

--

-- Drop the test tables if they exist

--

IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con

IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;

IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

GO

--

-- Create the conventional table

--

CREATE TABLE dbo.Rainfall_Con

(

[LocID] INTEGER NOT NULL,

[Date] DATETIME NOT NULL,

[Hour] TINYINT NOT NULL,

[Rainfall] DECIMAL(10,8) NOT NULL,

);

GO

--

-- Populate the conventional table with one year's data for 100 locations

-- This should complete in less than 30 seconds

--

DECLARE @Rows BIGINT;

SET @Rows = 100 * 365 * 24; -- 24 hours' data for 365 days for 100 locations

RAISERROR('Creating test data...', 0, 1) WITH NOWAIT;

;WITH Numbers (N)

AS (

-- Dynamically create rows numbered 0 to 875999

SELECT TOP (@Rows)

ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) - 1

FROM master.sys.all_columns C1,

master.sys.all_columns C2,

master.sys.all_columns C3

)

INSERT dbo.Rainfall_Con WITH (TABLOCKX)

(

[LocID],

[Date],

[Hour],

[Rainfall]

)

SELECT LocID = (N / 365 / 24) + 1,

[Date] = DATEADD(DAY, (N / 24) % 366 , '2008-01-01 00:00:00.000'),

[Hour] = N % 24,

[Rainfall] = RAND(CHECKSUM(NEWID())) * 10 -- A new random value for each row

FROM Numbers;

GO

--

-- Now create the primary key

-- Doing this *after* inserting the data results in a compact and contiguous index

-- If the index were to exist during the insert, index pages may split resulting in

-- a lower data density, and significant fragmentation

--

RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

ALTER TABLE dbo.Rainfall_Con

ADD CONSTRAINT PK_Rainfall_Con

PRIMARY KEY CLUSTERED([Date], [Hour], [LocID])

WITH (FILLFACTOR = 100);

GO

--

-- Create the XML table

--

CREATE TABLE dbo.Rainfall_XML

(

[LocID] INTEGER NOT NULL,

[Date] DATETIME NOT NULL,

[Rainfall] XML NOT NULL,

);

GO

--

-- Populate the XML table from the conventional table

--

RAISERROR('Creating XML data...', 0, 1) WITH NOWAIT;

INSERT Rainfall_XML WITH (TABLOCKX)

SELECT [LocID],

[Date],

(

-- The hourly data as XML

SELECT H = [Hour],

V = [Rainfall]

FROM dbo.Rainfall_Con A WITH (READUNCOMMITTED)

WHERE A.[LocID] = B.[LocID]

AND A.[Date] = B.[Date]

FOR XML RAW('RN'), ROOT('RT')

)

FROM Rainfall_Con B WITH (READUNCOMMITTED)

GROUP BY

[LocID],

[Date];

GO

--

-- Add the primary key now, for the same reasons as before

--

RAISERROR('Creating index...', 0, 1) WITH NOWAIT;

ALTER TABLE dbo.Rainfall_XML

ADD CONSTRAINT PK_Rainfall_XML

PRIMARY KEY CLUSTERED([Date], [LocID])

WITH (FILLFACTOR = 100);

GO

--

-- Show the space used by each table

--

-- Results on my system:

--

-- Conventional : 876,000 rows, 27016KB reserved, 26856KB data, 96KB index, 64KB unused

-- XML : 36,500 rows, 36504KB reserved, 36504KB data, 112KB index, 64KB unused

--

-- ### XML representation is 35% *larger* ###

--

EXECUTE sp_spaceused Rainfall_Con;

EXECUTE sp_spaceused Rainfall_XML;

GO

--

-- This table is used to capture the results from the test

-- SELECT statements. This avoids any delays in sending the

-- rows to the client from skewing the results

--

CREATE TABLE #BitBucket

(

[LocID] INTEGER NOT NULL,

[Date] DATETIME NOT NULL,

[Hour] TINYINT NOT NULL,

[Rainfall] DECIMAL(10,8) NOT NULL,

);

GO

--

-- TEST START

--

RAISERROR('Starting test', 0, 1) WITH NOWAIT;

--

-- Remove all ad-hoc SQL query plans from cache

--

DBCC FREESYSTEMCACHE (N'SQL Plans');

GO

--

-- Start with a cold data cache

-- Each implementation is run 10 times

-- One time with a cold cache, then nine times with a warm cache

-- This seems more representative to me, feel free to change it

--

DBCC DROPCLEANBUFFERS;

GO

/* SQL */

INSERT #BitBucket WITH (TABLOCKX)

SELECT TP.[LocID],

TP.[Date],

TP.[Hour],

TP.[Rainfall]

FROM dbo.Rainfall_Con TP WITH (READUNCOMMITTED)

WHERE TP.[Date] = '20080106'

AND TP.[Hour] = 15

ORDER BY

LocID ASC;

GO 10

--

-- For fairness, recreate the dump table for the second run

-- TRUNCATE TABLE might give the second run an unfair advantage

-- since mixed and uniform extent allocations would not need

-- to be re-done

--

DROP TABLE #BitBucket;

GO

CREATE TABLE #BitBucket

(

[LocID] INTEGER NOT NULL,

[Date] DATETIME NOT NULL,

[Hour] TINYINT NOT NULL,

[Rainfall] DECIMAL(10,8) NOT NULL,

);

GO

--

-- Second run - XML

--

DBCC DROPCLEANBUFFERS;

GO

/* XML */

INSERT #BitBucket WITH (TABLOCKX)

SELECT TP.LocID,

TP.[Date],

[Hour] = Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint'),

Rainfall_Con = Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)')

FROM dbo.Rainfall_XML TP WITH (READUNCOMMITTED)

WHERE TP.[Date] = '20080106'

ORDER BY

LocID;

GO 10

--

-- Show the total execution resource usage for all ten runs

--

SELECT query = LEFT(QT.[text], 9),

execution_plan = QP.query_plan,

run_count = QS.execution_count,

total_cpu_time_µs = QS.total_worker_time,

total_logical_reads = QS.total_logical_reads,

total_elapsed_time_µs = QS.total_elapsed_time,

avg_cpu_time_µs = QS.total_worker_time / QS.execution_count,

avg_logical_reads = QS.total_logical_reads / QS.execution_count,

avg_elapsed_time_µs = QS.total_elapsed_time / QS.execution_count

FROM sys.dm_exec_query_stats QS

CROSS

APPLY sys.dm_exec_sql_text (QS.[sql_handle]) QT

CROSS

APPLY sys.dm_exec_query_plan (QS.[plan_handle]) QP

WHERE QT.[text] LIKE '%Rainfall%'

AND QT.[text] NOT LIKE '%dm_exec_query_stats%'

ORDER BY

QS.last_execution_time ASC;

GO

--

-- Sample results:

--

-- SQL Averages: CPU 683µs; Elapsed time: 683µs; Logical reads: 114

-- XML Averages: CPU 80078µs; Elapsed time: 83691µs; Logical reads: 126

--

-- Tidy up

--

GO

IF OBJECT_ID(N'dbo.Rainfall_Con', N'U') IS NOT NULL DROP TABLE dbo.Rainfall_Con

IF OBJECT_ID(N'dbo.Rainfall_XML', N'U') IS NOT NULL DROP TABLE Rainfall_XML;

IF OBJECT_ID(N'tempdb..#BitBucket', N'U') IS NOT NULL DROP TABLE #BitBucket;

GO

RAISERROR('Test run complete', 0, 1) WITH NOWAIT;

--

-- END SCRIPT

--

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

spaghettidba SSC Guru Points: 105731 More actions · Answer 6

Nice article, I learned something about XML in SQL Server today.

As Paul pointed out, it's impossible that XML storage needs less space than conventional storage.

With appropriate indexing, there's no way that XML works faster than a relational table.

I suggest that you add some more comments to your article to avoid misleading somebody on this topic.

Nice idea, anyway.

Gianluca

-- Gianluca Sartori

Ian Scarlett SSC-Insane Points: 23195 More actions · Answer 7

If you make your primary key in both tables CLUSTERED (which would be a reasonable choice in this case given that some of the queries are ORDERing/GROUPing by locid), the storage requirements are somewhat different... XML now uses more storage.

name rows reserved data index_sizeunused

Rainfall_XML 36600 36744 KB 36600 KB 112 KB32 KB

name rows reserved data index_sizeunused

Rainfall_Con 878400 27080 KB 26928 KB 96 KB56 KB

GSquared SSC Guru Points: 260824 More actions · Answer 8

Where I've found the XML data type useful has been in situations where you can't control schema changes to a relation (table) and want to record all data from it even if columns are added, modified or removed.

Logging triggers, or archiving processes in these cases can take advantage of "select * from table for xml" and insert into an XML column in a log/archive/audit table.

For semi-structured data that has variable schema elements, XML is pretty ideal. Beyond that, relational data structures are better.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

The Dixie Flatline SSC Guru Points: 53353 More actions · Answer 9

Another thing to point out about XML's use of space is that, barring compression, the data is being stored as characters wrapped in tags, making storage of integer, numeric, or floating point data LESS efficient.

Datatype SQL Length XML XML Length

--------- ----------- ---- ------------

TinyInt 1 Byte 255 10

SmallInt 2 Bytes 32767 12

Int 4 Bytes 2147483647 17

FLOAT 8 Bytes -2.23.40E-308 20

In the author's example, he chose numeric(10,8) to store the rainfall amount, which would require only 9 bytes for SQL to store, but 18 for XML to store (90.12345678).

The author also chose to impose an [hour] column on the conventional table, which was unnecessary, because it could have been included as part of the [date] column.

Finally, he could have chosen to have a separate column for each hour in his schema for the conventional table, rather than making it one row per hour. (That wouldn't have been my first choice, because I feel it lacks flexibility in manipulating the data. But it is essentially the approach that was taken with the table that used XML.)

__________________________________________________

Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Steve Jones - SSC Editor SSC Guru Points: 742834 More actions · Answer 10

It's an interesting look from a developer point of view. Compared to the power and speed of the relational engine, I think space savings are a minimal thing to worry about. Disk prices fall at a rate far beyond that of the improvements in CPU and memory when you compare price/performance.

I agree with many of the comments, that a well designed relational system is probably better, but this is an interesting solution and idea. If disk space matters a lot, perhaps it's worth considering, but I would recommend that you investigate some of the other suggestions first.

David McKinney SSChampion Points: 10472 More actions · Answer 11

I think the merit in this article lies more in the techniques that are covered, rather than in the premise that it ultimately failed to prove.

The article does show a couple of good xquery examples in context - how to select a value from a node and incorporate it in a resultset, and how to modify the value of a node.

SELECT Rainfall.value('(/RT/RN[@H=15]/@H)[1]','tinyint') AS Hour

,Rainfall.value('(/RT/RN[@H=15]/@V)[1]','numeric(10,8)') AS Rainfall

UPDATE TP SET Rainfall.modify('replace value of (/RT/RN[@H="20"]/@V)[1] with "8.12345678" ')

All in all, not a bad article, well written and clear explanations, even if in some respects fundamentally flawed!

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 12

Paul White (8/4/2009)
This script is closely based on that in the article.
I would strongly encourage the author of this article to try it out 🙂
Nice idea for an article, but size and speed are not XML strengths! 😉
Paul

MythBuster extraordinaire... that script goes into my "Pork Chop Hall of Fame". Nice job, Paul.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 13

Steve Jones - Editor (8/4/2009)
It's an interesting look from a developer point of view. Compared to the power and speed of the relational engine, I think space savings are a minimal thing to worry about.

Space savings are certainly important to me especially when it comes to backups and restores but according to Paul's script... what space savings?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Jones - SSC Editor SSC Guru Points: 742834 More actions · Answer 14

Didn't test it, and perhaps there are none. In any case, I think it's been a nice debate/discussion of the technique.