Automatic XML

  • On the topic of performance on big xml files I can add some more tips.

    - Don't ever use "back-references" (i.e. "..", "..." and such), these will bog everything down to a complete standstill when more than a "few" elements are involved. Instead use "cross apply r.r.nodes('subnode') s(s)" to traverse over different levels in your document, while still being able to address values from both levels using their declared aliases (in my example r.r and s.s.)

    - Use an additional cross apply to retrieve values from the elements using the value()-function. i.e. Unless you're in the simplest type of xml (a simple list of elements), always apply this pattern:

    declare @xml xml;

    select @xml = N'

    <root>

    <el1>el1</el1>

    <el2>el2</el2>

    <subnode>

    <el3>el3</el3>

    <el4>el4</el4>

    </subnode>

    </root>

    ';

    select xr.el1,

    xr.el2,

    xs.el3,

    xs.el4

    from @xml.nodes('root') r(r)

    cross apply (

    select r.r.value('el1[1]', 'varchar(30)') as el1,

    r.r.value('el2[1]', 'varchar(30)') as el2

    ) xr

    cross apply r.r.nodes('subnode') s(s)

    cross apply (

    select s.s.value('el3[1]', 'varchar(30)') as el3,

    s.s.value('el4[1]', 'varchar(30)') as el4

    ) xs;

    This is from experience only, I haven't found any references yet explaining it. My guess is that it has something to do with the optimizer being able to determine a type for the columns early so that it can create a more optimal plan or something like that.

    - When using openrowset to read an xml document: read it into an xml variable, then query its contents from that variable. Don't put openrowset in your query itself, it's output is far to dynamic for the optimizer to make any useful plan from it.

    - When constructing xml, use the ",type"-clause on the "for xml" clause, or the engine will generate text output that will need to be re-parsed into xml for the rest of the query.

    - When possible, use "typed xml". i.e. give the engine the format of the documents it can expect. It will make good use of this extra knowledge by extra optimising your queries. However development and roll-out will become very annoying when the xsd changes (I have so far never had one that was right from the start). It is like having to develop all of your views, procedures, functions and triggers with option schemabinding and then finding you need to change your tables, but worse, because now even tables and user defined types can reference the xsd, so you need to drop virtually your entire database before you can apply any change to the xsd...



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • Ah... lucky me! The requirements for my task have eased up a bit. It turns out that the "data vendors" CAN provide XSD files so I'm definitely "out of the woods" there.

    I'd still like to know if there's a way to take ANY XML file and and automatically return it as a denormalized table without any prior knowledge of the XML schema. I'll try to cough up a couple of examples of what I mean when I get home tonight.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Maybe one of Jacob Sebastians Blogs can help.

    However, I have no idea what the performance counters will display...

    But I have not seen any equivalent XQuery solution yet.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Also see here to remedy the encoding feature http://weblogs.sqlteam.com/peterl/archive/2009/03/05/Extract-XML-structure-automatically.aspx

    And here is how you can filter out certain nodes

    http://weblogs.sqlteam.com/peterl/archive/2009/06/04/Extract-XML-structure-automatically-part-2.aspx


    N 56°04'39.16"
    E 12°55'05.25"

  • LutzM (8/24/2011)


    Maybe one of Jacob Sebastians Blogs can help.

    However, I have no idea what the performance counters will display...

    But I have not seen any equivalent XQuery solution yet.

    I'd already worked out something similar and was headed to the "next step". I do appreciate the link, though.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks, Peter. I'll have a look.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • As a side bar... I'm truly disappointed in XML in general. I thought the whole purpose of transmitting tags was so that you'd automatically "know" what's in the file. I realize I'm just getting into it but I currently see no way to identify columnar datatypes nor easily do what I asked. Worse yet, no real space is saved over a denormalized delimited file and the hierarchical nature of XML data is a real PITA compared to denormalized delimited files if you want to do something outside the proverbial box.

    Amen to that, Jeff. Much of what I see implemented as XML data is really just a flat file. A spec must still be developed and agreed upon by the humans trying to set up each communication. There's got to be a better way!

    --SJT--

  • SJTerrill (8/24/2011)


    As a side bar... I'm truly disappointed in XML in general. I thought the whole purpose of transmitting tags was so that you'd automatically "know" what's in the file. I realize I'm just getting into it but I currently see no way to identify columnar datatypes nor easily do what I asked. Worse yet, no real space is saved over a denormalized delimited file and the hierarchical nature of XML data is a real PITA compared to denormalized delimited files if you want to do something outside the proverbial box.

    Amen to that, Jeff. Much of what I see implemented as XML data is really just a flat file. A spec must still be developed and agreed upon by the humans trying to set up each communication. There's got to be a better way!

    --SJT--

    Schema files ARE the way to set up such a contract: they provide the way to describe type, length values, required elements, cardinality, even function and dev notes. XML tends to get bloated very fast, because we "think" we can get away with not passing or establishing what the contract is.

    You're also looking at contradictory requirements: you can't have something self-describing, with perfect understanding of what it is representing, and still be unhappy with the fact that that extra metadata takes up space. At best you can do what Jeff is doing now, i.e. discovering new data elements that you didn't know about before. That said - what exactly can you do with that once you FIND the new elements? I'd be very nervous about a self-discovery system which then also automagically claims to understand HOW to use the data it just found, having had to spend months untangling problems around "self-describing schemas" gone awry.

    With a solid schema - I could do away with the ludicrous element names or descriptions, do away with on the fly data attributes informing me of what data type I might find in the element ebing passed, etc... and make my files a lot smaller and efficient. In short - all of those bad habits you see, which we all fight against, don't get any better when you try to use XML. An EAV model represented in XML is still something that will suck the life out of your system sooner or later, unless you keep it for storing new elements on a VERY temporary basis.

    Finally - SQL Server's implementation of the XML standard is very small, compared to some of the really nice features out there. Most of the advanced FLWOR features are just flat missing. Some of the self-discovery mechanisms you need would be best done outside of SQL (the old right tool for the right purpose thing).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt, I absolutely agree with what you just said about setting up the "contract". You really DO need to know exactly what the data and columns are for and what the data type for each is.

    That being said, that's precisely why I think that XML isn't all that it's cracked up to be. Just as if you were sending, say, a CHAR(31) delimited file, you'd need to know what the name of each column was and what it's data type was. That would probably require a separate file just like XML needs an XSD file to really do it right.

    The advantage of using a delimited file or even a fixed field (binary preferred, of course) is that there are huge savings to be had in the number of bytes in the file which affects transmission speed, storage, processing time, etc, etc.

    I agree with what you said but, at this point in my discovery of XML, I still don't see it (or anything else) as an advantage that XML has over anything except, maybe, Cuneiform Tablets. 😀

    Finally - SQL Server's implementation of the XML standard is very small, compared to some of the really nice features out there.

    Like I said... being a definite newbie on the subject, I've not actually worked with the XML standard. Could you briefly describe some of those "really nice features" you speak of because, right now, I wish I never heard of XML. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well - some of the nice features of interest to a data guy like yourself:

    - strong-typing facilitiies using schemas and namespacing. Not only do I get the data types with length, etc., the concentept of enumerations (i.e. "lookup table" type constraints), cardinality (I should have at least n... of this kind attached to this other kind), and enforced relations (look at the ID....IDref constructs). These are baked in at least to some degree into SQL's rendition, but what is not is advanced pattern-matching. As an example - we do a vehicle feed, and pre-validate VIN, address, phone, SSN using standalone schema tools without havng to piss off our rather busy DB servers.

    - the XML toolset we use allows for easy reverse AND forward engineering of the data model. So - I could use a simple XML file like the one you were discussing to a. create a candidate schema, and b. create a table structure that accomodates that data. No custom scripting, just point and click (or if I need to automate, open SDK and call a few functions).

    - ability to use XML as BOTH a "set" (orderless) and a "sequence" (ordered). This allows for use of optimized transform engines to do things I can easily do in SQL (like agregate functions, concatenation, etc...) but also other things like "previous" and "next" searches. position() (I am nth node of this kind in a file) is also handy when the physical order in the file is significant.

    - portable transform logic, which is modular and perf optimized. Given the types of data i play with, having the process to clean up, reformat and validate the data before I hand it to and SQL, while reusing a lot of the constructs I alrady have enforced in my data model. I can also hand the SAME transforms to the java and .NET teams, so everyone plays by the same sheet of music. Also - hooking that logic up to the right engine can be done screamingly fast.

    In short - while the format itself might not be all that superior to anything else, the instrumentation around it is, especially when I get to use industry-standard rich schemas (where a fair amount of the heavy lifting has already been done for me and we only need to put our own "spin" the standard). The set of standards allow us to keep a lot of our SSIS jobs simple and clean (i.e. screaming), since a lot of that ugly handling is already done long before it ever hits.) while still keeping to the old "trust but validate" approach: check my inputs, ensure the data is valid, and THEN let loose and fire it in as fast as you can.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Very cool. Thanks for taking the time to write that up, Matt. It keeps me interested in learning more about XML.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 11 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply