The use of individual columns (type bit) to indicate the sub-type (Person or Organisation) is really awful. It does not cater for sub-sub-types (e.g. different types of Person). It is not extensible without adding a new column (type bit) for each new sub-type. It requires table constraints to guarantee that each row has exactly one type flag set.
I suggest that the type column be a computed column using an expression involving the schema of the xml (ideally) or the elements of the xml (if no schema). The standard help pages for xml have examples of doing this.
Simon, interesting article.
Our client has used XML in certain areas of designs, with varying degree's of success. For static data, it can be relatively effective, but, as you mention, as soon as data starts getting updates, performance goes to a dark, smelly place. I would note that you mention that XML usage seems alright for smaller tables/areas - this does assume that the DB design has decent, accurate volumetrics, or else a robust archival strategy that ensure that what is today a small/medium sized table, will still be one in a year/2 years etc. Failing that, the 'acceptable performance with XML' of today, could become tomorrow's performance pain.
In playing around with XML, have you done any work using XML schema's to enforce some structure on the XML? I heard a concerning arguement for XML from a db developer of all people. The jist of it was: 'by using XML to describe the entity, a middle-tier programmer can easily enhance the strucutre without having to battle to get it change in the database schema, then get changes in the data access layer. This will mean we can develop a lot faster, not be constrained by getting difficult to resource DB developers,and be more agile'. Yeah - and also mean that suddenly we find an enterprising middle-tier developer has felt the need to add OrderNumber to the Client XML column, so he doesn't need to use the joins to the Order Table structure - and before we know it, XML bloat and completely 'hidden' data that know one understands because the schema defining it is 'somewhere in the middle tier'.
So, if I have to have XML in the databases, I'd like to try and use schema's to make sure that data in the XML is still properly designed, defined and understood. Your thoughts/experiences?
By and large I've stayed away from XML in the database for all the reasons that most people here can list. I have found that XML can play a part in feeding data to the system. We've found that, depending on the number of columns and their size, that using XML to send in larger amounts of data works better than multiple stored procedure calls. So, for example, we want to insert into table X. We can write a traditional insert proc and then have the developers call it. For one row, this works flawlessly. It also works great for 3,5,10 & even 30 rows. But after a time, and that does vary, testing is required, the number of calls gets nasty. So instead, we pump the data into XML and then use OPENXML (with all it's memory bloat) to shred the data into a table with a single insert statement and it performs better than the multiple calls from the middle tier or client.
The only thing I'd say about implementing this elsewhere is testing, testing, testing. Verify the results in your environment before you bet the farm on any given solution.
Interested to know how you handle accessing the data, after you've stored it. Fetching the entire XML = far more IO, memory and netowrk usage than might be required, if only a small section of data, like who completed the document, and when, needs to be extracted. If you are using the XML data, how are you seeing performance? It's been pretty poor in our examples of high-volumes, especially if you need to use nodes and elements as part of your critieria, so I'm interested to know who you've handled that - I'm sure I can learn from it...
If that meets your user requirements, then that's what matters. How many users, and how many documents, do you have, if I might ask?
Our XML columns, where used and when we had issues especially (we 'normalized out' the temporal data which made a huge difference), was on a tables that had 2 Million rows, and was used relatime by an online retail sites that takes, at peak, around 5000 orders an hour - or just under 1.5 orders per second. On average, the basket (which was the XML) was being accessed about 20 times per orders , so you can see that during peak trade, our XML column was being used, and potentially changed, in the order of 100 000 times in an hour, so the best possible performance is kind of critical :-)