Need direction with DB table structure, denormalize?, and using SSIS and XML

  • Not really a "newbie" here as I have been a programmer for 20+ years, have used SSAS, SSIS and created normalized DB and tables. But here is my new challenge and project. We have an outside agency that captures clients, their assessments, and that assessment data. I have successfully created a simple asp.net app that gets this data through a web service and returns an XML response file. Now I want to use this XML file in SSIS to transfer that data to either a denormalized table or a normalized database table structure.

    So I have the client (ID, name, age, etc); the assessment (ID, type, date, and some other associated data); the assessments subscale data (assessmentID, subscale name, score); another type of subscale data set with similar information; and then a few other nodes in this XML that are associated with that assessment.

    In the end I want to build a cube to analyze the data. There will not be millions of rows, but for each year maybe thousands of clients and thousands of assessments with many more thousands of data for each assessment. Should I normalize the data and create a table for client, assessments, assessment data (joined by assesmentID and clientID foreign keys)? Or should I create a denormalized data warehouse? Or both?

    When it comes to the SSIS I see that I might need to use Merge Joins with Sorts, or if I do use normalized tables, I can just dump the data into each table, however some of the assessment detail tables do not have access to the foreign keys of clientID and assessmentID. Or I could create an xls file to flatten the original XML file and dump it right into a denormalized table.

    Thanks for any advice and steps in the right direction for this. Again, ultimately I will be using SSAS data cubes to analyze this data. I can create a cube off of normalized data, right?

  • Should I normalize the data and create a table for client, assessments, assessment data (joined by assesmentID and clientID foreign keys)? Or should I create a denormalized data warehouse? Or both?

    If the point is to analyze this data I think the way to go would be to denormalize the data then create indexes and constraints to support your queries. How you get the data there depends on what the XML looks like. You may need to build some temporary staging tables pull the XML data into those then use them to build your denormalized schema.

    Could I create an xls file to flatten the original XML file and dump it right into a denormalized table.

    If by "xls" file you are talking about an Excel spreadsheet I don't know what to tell you. I'm going to assume you are talking about an XSL or XSLT (XML Transform) file and If I'm wrong you can ignore the rest of my comment because it's not relevant. With respect to ingesting XML data using SSIS you have a several options, here's a few I've used:

    1. Create a new XML task and choose XSLT as the option type then write and XML transform to "flatten" your XML data

    2. Choose another method for applying XSLT to your XML such as doing it through a scripting task or by using mdq.XMLTransform.

    ...Once the XML is flattened you could then insert it into a denormalized schema

    3. You can flatten you data with XSLT using an SSIS XML task and push the data into staging tables or directly into your denormalized schema.

    4. You could create an SSIS Execute SQL task and pull the data in using OPENXML and the T-SQL XQuery methods (e.g. nodes()).

    It just depends on how complex the XML is and what your level of XML skills are.

    I can create a cube off of normalized data, right?

    Sure but fact and dimension tables are the way to go.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks Alan for the reply and yes, I did not mean XLS file (not an excel file), but an XML transformation. Simplistically, here is the schema:

    <response>

    <client ID="" name="" age="">

    <assessment id="" date="" clientID="" evaluator="" ...>

    <subscale1 name="">

    <score="">

    <notes="">

    <subscale2>

    .

    .

    .

    <assemssentSummary>

    <aFewOtherNodesHere>

    The only relation is clientID in the client node and then again in the assessmentid node. I figured if I could denormalize the data and make a single record then it would work. The subscale scores are named by a classification and there is a set amount of them. Before I left work (as I am home right now), I was trying to use a Merge Join and I successfully did merge client and assessments. Just cannot figure out how to tie in the subscales, or does the XML know that they are related? I do have a lack of XML experience, but I know enough to get around it.

    So it sounds as if I should try to flatten the XML file then move it into a denormalized database fact table. And yes, I do know some xQuery. Now I have a task for tomorrow

  • Quick thoughts, do you have an XSD for the XML? If that exists then it's normally the best starting point for modelling the database schema.

    😎

    Bulk loading the XML into a table and XQuery the content into staging tables is probably the fastest way of loading the data, it can then be merged into the destination schema.

  • I do have the XSD (schema) for the XML file and that is indeed what I based the tables on. It is here.

    <?xml version="1.0"?>

    <xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">

    <xs:element name="response">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" name="clientAssessments">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" maxOccurs="unbounded" name="client">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" name="assessment">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" name="youthSubscaleScores">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" maxOccurs="unbounded" name="Subscale">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" name="explanation" type="xs:string" />

    <xs:element minOccurs="0" name="planText" />

    </xs:sequence>

    <xs:attribute name="name" type="xs:string" use="optional" />

    <xs:attribute name="score" type="xs:string" use="optional" />

    <xs:attribute name="isCouldNotScore" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="caregiverSubscaleScores">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" maxOccurs="unbounded" name="Subscale">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" maxOccurs="unbounded" name="explanation" type="xs:string" />

    </xs:sequence>

    <xs:attribute name="name" type="xs:string" use="optional" />

    <xs:attribute name="materialNeedsScore" type="xs:string" use="optional" />

    <xs:attribute name="socialSupportScore" type="xs:string" use="optional" />

    <xs:attribute name="isCouldNotScore" type="xs:string" use="optional" />

    <xs:attribute name="caregiverName" type="xs:string" use="optional" />

    <xs:attribute name="caregiverRelation" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="riskBehaviors">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" maxOccurs="unbounded" name="behavior" type="xs:string" />

    </xs:sequence>

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="CAFASTier" type="xs:string" />

    <xs:element minOccurs="0" name="childMgmt">

    <xs:complexType>

    <xs:attribute name="consider" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="severeImpairments">

    <xs:complexType>

    <xs:attribute name="total" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="pervasiveBehImp">

    <xs:complexType>

    <xs:attribute name="result" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="change">

    <xs:complexType>

    <xs:sequence>

    <xs:element minOccurs="0" name="totalScore">

    <xs:complexType>

    <xs:attribute name="change" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="severeImpairments">

    <xs:complexType>

    <xs:attribute name="change" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="pervasiveBehImp">

    <xs:complexType>

    <xs:attribute name="change" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="meaningfulReliableDiff">

    <xs:complexType>

    <xs:attribute name="change" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    <xs:element minOccurs="0" name="improvement1orMore">

    <xs:complexType>

    <xs:attribute name="change" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    <xs:attribute name="assessmentDate" type="xs:string" use="optional" />

    <xs:attribute name="assessmentID" type="xs:string" use="optional" />

    <xs:attribute name="isDeleted" type="xs:string" use="optional" />

    <xs:attribute name="administrationDescription" type="xs:string" use="optional" />

    <xs:attribute name="isLocked" type="xs:string" use="optional" />

    <xs:attribute name="assessmentStatus" type="xs:string" use="optional" />

    <xs:attribute name="employeeID" type="xs:string" use="optional" />

    <xs:attribute name="serviceAreaCode" type="xs:string" use="optional" />

    <xs:attribute name="serviceAreaName" type="xs:string" use="optional" />

    <xs:attribute name="programCode" type="xs:string" use="optional" />

    <xs:attribute name="programName" type="xs:string" use="optional" />

    <xs:attribute name="amendedAssessmentID" type="xs:string" use="optional" />

    <xs:attribute name="rater" type="xs:string" use="optional" />

    <xs:attribute name="episodeNumber" type="xs:unsignedByte" use="optional" />

    <xs:attribute name="totalScore" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    <xs:attribute name="primaryId" type="xs:unsignedInt" use="optional" />

    <xs:attribute name="name" type="xs:string" use="optional" />

    <xs:attribute name="age" type="xs:unsignedByte" use="optional" />

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    </xs:complexType>

    </xs:element>

    </xs:sequence>

    <xs:attribute name="exception" type="xs:unsignedByte" use="optional" />

    <xs:attribute name="error" type="xs:unsignedByte" use="optional" />

    <xs:attribute name="type" type="xs:string" use="optional" />

    </xs:complexType>

    </xs:element>

    </xs:schema>

  • Also, something I thought about that did "flatten" this file is opening the XML with Excel. I can then use the Excel file in SSIS. Wondering now if there is a way to automate that part of the process?

  • I'll have a look, can you get couple of data samples to work with, makes things easier as otherwise I would have to start guessing.

    😎

    I've done systems that to high volume XML imports, had to use the SSIS/C# route few times but those do not perform well enough in this context to be considered viable.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply