Data flow concept - Thoughts wanted

  • Tom_Hogan - Wednesday, August 23, 2017 7:11 AM

    In the meantime, we have an intern manually loading the text files to staging tables.

    BWAAA-HAAA!!!! So... "artificial intelligence". 😀😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Tom_Hogan - Wednesday, August 23, 2017 1:59 PM

    Orlando Colamatteo - Wednesday, August 23, 2017 1:12 PM

    Tom, Not sure if you saw my post but I provided a concise solution involving PowerShell for consideration. I think it might save you from having to maintain 200+ Packages and creating n more over time as you onboard more data partners.

    I started playing with PowerShell to put the data into a common format based on your suggestion.  I might go pipe delimited rather than XML for SSIS use.  Pretty cool that we can do that type of conversion with only a couple of lines of code.

    Thanks.

    I've got to ask, what's the difference for you between comma and pipe delimited?  Might be a wasted effort to make that change.

    Although the XML method is easy, I sure don't like it from a resource usage point of view.  A 63 byte csv file suddenly exploded to a 586 byte xml file (9.3 times the size of the CSV) and then you get to import that and shred it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Wednesday, August 23, 2017 4:13 PM

    I've got to ask, what's the difference for you between comma and pipe delimited?  Might be a wasted effort to make that change.

    Although the XML method is easy, I sure don't like it from a resource usage point of view.  A 63 byte csv file suddenly exploded to a 586 byte xml file (9.3 times the size of the CSV) and then you get to import that and shred it.

    I was mostly talking pipes vs XML as a the "end" text file format.  We receive files in CSV, pipe-delimited, tab-delimited, and fixed length formats (not a whiff of XML in the bunch).  I'm not sure what benefit putting the data into XML vs., say, CSV would be from an SSIS perspective.

  • Jeff Moden - Wednesday, August 23, 2017 3:52 PM

    Tom_Hogan - Wednesday, August 23, 2017 7:11 AM

    In the meantime, we have an intern manually loading the text files to staging tables.

    BWAAA-HAAA!!!! So... "artificial intelligence". 😀😛

    I call myself Alexa when I have to load the files if there's no one else available.

  • Tom_Hogan - Wednesday, August 23, 2017 4:56 PM

    --.  I'm not sure what benefit putting the data into XML vs., say, CSV would be from an SSIS perspective.

    Me neither. I agree with Jeff's comments, at least when it comes to XML !

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin - Wednesday, August 23, 2017 5:42 PM

    Tom_Hogan - Wednesday, August 23, 2017 4:56 PM

    --.  I'm not sure what benefit putting the data into XML vs., say, CSV would be from an SSIS perspective.

    Me neither. I agree with Jeff's comments, at least when it comes to XML !

    The benefit could be in the ability to make the file-transformation process generic such that you never have to modify it, even when new fields are added to a data partner file format. The downside is the XML tag overhead, however in this case I think it might be a good trade-off. I usually prefer a design that allows my data partners the flexibility to add new fields and have my ingestion process require as few changes as possible.

    If we look at two solutions side-by-side and we walk through a scenario where a data partner wants to add a field that I eventually want in my database we can see a clear difference in the level of effort and coordination required to respond to the change. Say Data Partner ABC wants to start sending me date_of_birth as a new field in a file they already send me.

    Solution 1 converts all incoming flat-files into a common flat-file format (e.g. pipe-delimited) which is fed to a common SSIS Package. For this solution to keep up with ABC's changes I would need to modify the file-transformation process on the same day when ABC introduces the date_of_birth field in their file to ensure it is passed through in the common flat-file format the SSIS Package receives. The first part of this is tightly coupled to the data partner's changes and requires me to deliver changes to my process in order to begin processing the new data field, a type of arrangement I like to avoid with data partners.

    Solution 2 converts all incoming flat-files into a generic XML format which is fed to a common SSIS Package. For this to keep up with ABC's changes I will not need to modify the file-transformation process and ABC can roll, or rollback, their change to introduce date_of_birth without affecting anything on my side because date_of_birth will automatically show up in the XML file per the generic transform process. Since my SSIS Package already knows about date_of_birth since other data partners already send it to me, I had no changes to make and now I am getting date_of_birth from ABC.

    This is one scenario where some changes were required in one solution and no changes were required in the other. There are other scenarios where some changes would need to happen in either solution, but for the ones coming to mind still less change to the XML based solution.

    Will byte counts be larger with XML? Without a doubt. In most cases I am willing to trade bytes for flexibility, resilience and the ability to insulate my systems from external changes.

    Hopefully all my assumptions about your situation are close to your real situation. Sorry for the winding and weaving post. Hopefully it makes sense. Let me know how you're thinking about the solution.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo - Wednesday, August 23, 2017 8:44 PM

    The benefit could be in the ability to make the file-transformation process generic such that you never have to modify it, even when new fields are added to a data partner file format. The downside is the XML tag overhead, however in this case I think it might be a good trade-off. I usually prefer a design that allows my data partners the flexibility to add new fields and have my ingestion process require as few changes as possible.

    If we look at two solutions side-by-side and we walk through a scenario where a data partner wants to add a field that I eventually want in my database we can see a clear difference in the level of effort and coordination required to respond to the change. Say Data Partner ABC wants to start sending me date_of_birth as a new field in a file they already send me.

    Solution 1 converts all incoming flat-files into a common flat-file format (e.g. pipe-delimited) which is fed to a common SSIS Package. For this solution to keep up with ABC's changes I would need to modify the file-transformation process on the same day when ABC introduces the date_of_birth field in their file to ensure it is passed through in the common flat-file format the SSIS Package receives. The first part of this is tightly coupled to the data partner's changes and requires me to deliver changes to my process in order to begin processing the new data field, a type of arrangement I like to avoid with data partners.

    Solution 2 converts all incoming flat-files into a generic XML format which is fed to a common SSIS Package. For this to keep up with ABC's changes I will not need to modify the file-transformation process and ABC can roll, or rollback, their change to introduce date_of_birth without affecting anything on my side because date_of_birth will automatically show up in the XML file per the generic transform process. Since my SSIS Package already knows about date_of_birth since other data partners already send it to me, I had no changes to make and now I am getting date_of_birth from ABC.

    This is one scenario where some changes were required in one solution and no changes were required in the other. There are other scenarios where some changes would need to happen in either solution, but for the ones coming to mind still less change to the XML based solution.

    Will byte counts be larger with XML? Without a doubt. In most cases I am willing to trade bytes for flexibility, resilience and the ability to insulate my systems from external changes.

    Hopefully all my assumptions about your situation are close to your real situation. Sorry for the winding and weaving post. Hopefully it makes sense. Let me know how you're thinking about the solution.

    I think we're close.  I'm going to try and apply my understanding what's been proposed to to how the data is actually coming in.

    Here's what I'm thinking the data flow will be:

    1. Data is delivered in a variety of flat file formats.
    2. We use PowerShell scripts to transform the data into a common format.
    3. SSIS package then picks up the common formatted file and processes.

    Using Date of Birth as the example.  What we're getting is that Customer_A sends it as DOB, Customer_B doesn't sent it at all, and Customer_C sends it as BirthDate.  I'd have three PoSh scripts to copy the file data to the common format where the field is called date_of_birth (we can pretend Customer_D sends it in with that exact name).

    Scenario 1: Existing column.   In either solution, I expect I'd be passing an empty string to the date_of_birth field for Customer_B.  We need to have that column defined in the common format for it to be processed.  I'd then have to modify the PoSh script for Customer_B and update it to pass the DateOfBirth column they added to the common format date_of_birth field.  Until I modify the script the DateOfBirth field is ignored.

    Scenario 2: New column.  No customers are sending me Date of Birth.  Now Customer_B wants to send us DateOfBirth.  In Solution 1, I'd have to update all my PoSh scripts to pass an empty value to the data_of_birth column with the script for Customer_B passing an actually value to date_of_birth.  Again, because we need to have that column defined in the common format for it to be processed.  I'd then have to modify my SSIS to handle the new column.  In Solution 2, I'd have to modify the just the PoSh script for Customer_B to add date_of_birth to the XML as an element.  The other scripts don't need to be touched as I only care about the element once I get to parsing.  I'd then have to modify my SSIS to handle the new column (parse the data if it exists, ignore if it doesn't).

    Does that make sense?  Am I missing anything?

    From what i can tell XML allows me to add elements without having to modify the bunch of customer specific PoSh scripts to add a new field.  Flat file saves me the step of parsing the XML (I'll admit I haven't done much XML with SSIS), drive space, and I can perform transformations / conversions as part of the Data Flow Task.

    As I mentioned previously, we currently have to take the data in the format the customers send.  Hopefully, we can eventually dictate a couple of acceptable formats.

    Thanks.

  • I think we're close. I'm going to try and apply my understanding what's been proposed to to how the data is actually coming in.

    Here's what I'm thinking the data flow will be:

    1. Data is delivered in a variety of flat file formats.
    2. We use PowerShell scripts to transform the data into a common format.
    3. SSIS package then picks up the common formatted file and processes.

    We're on the same line of thinking but for clarity, in the case of XML, Solution 2, I'd propose we split step 2. 

    1. Data is delivered in a variety of flat file formats.
    2. We use a customer-specific PowerShell script to adjust the column header to match the field names used in the common format. This script would only be needed for customers that sent files with column headers that contained alternate field names, e.g. DateOfBirth instead of date_of_birth. This is a one-line PowerShell script for the customers where you need to produce one.
    3. We use one generic PowerShell script to transform any incoming customer data file into XML. This script would be called for every customer file.
    4. SSIS package then picks up the common XML file and processes.

    Using Date of Birth as the example. What we're getting is that Customer_A sends it as DOB, Customer_B doesn't sent it at all, and Customer_C sends it as BirthDate. I'd have three PoSh scripts to copy the file data to the common format where the field is called date_of_birth (we can pretend Customer_D sends it in with that exact name).
    In the XML approach you'd only need to create two PoSh scripts, one for Customer_A that changed the column header from DOB to date_of_birth and one for Customer_C that changed the column header from BirthDate to date_of_birth. Since the Customer_B does not send the data the XML produced from their file will not have that attribute and SSIS will treat it as a NULL when it consumes the XML.

    Scenario 1: Existing column. 

    In either solution, I expect I'd be passing an empty string to the date_of_birth field for Customer_B.
    With XML you simply would not see the attribute for date_of_birth in the XML file. In XML the absence of the element or attribute implies NULL whereas in flat-files the field still needs to be there, yet be empty, so structure is mixed with data and that makes it rigid when compared to XML. In the SSIS Package, even though it is setup to map date_of_birth if it is not there it treats it as NULL. This is an important advantage over flat-files given that in a flat-file structure unless you agree on a sentinel value for NULL with your data provider you have no way to differentiate NULL from an empty string.

    Scenario 2: New column. No customers are sending me Date of Birth. Now Customer_B wants to send us DateOfBirth. In Solution 1, I'd have to update all my PoSh scripts to pass an empty value to the data_of_birth column with the script for Customer_B passing an actually value to date_of_birth. Again, because we need to have that column defined in the common format for it to be processed. I'd then have to modify my SSIS to handle the new column.

    Agreed. So when you are using a flat-file as your common format all PoSh scripts would need to change to introduce a new data point for any one customer. This is where the XML solution really starts to look more attractive from my standpoint.

    In Solution 2, I'd have to modify the just the PoSh script for Customer_B to add date_of_birth to the XML as an element. The other scripts don't need to be touched as I only care about the element once I get to parsing. I'd then have to modify my SSIS to handle the new column (parse the data if it exists, ignore if it doesn't).

    With Solution 2, if you decided to simply adopt DateOfBirth as the column name you would have no changes to make to any PoSh scripts. If you wanted it to be date_of_birth, then yes, you'd have to change Customer_B PoSh script to flip the column header from DateOfBirth to date_of_birth before converting the file to XML format.

    Does that make sense? Am I missing anything?

    It seemed to and I think I was able to follow you all the way through. Please let me know if I went off track.

    From what i can tell XML allows me to add elements without having to modify the bunch of customer specific PoSh scripts to add a new field.

    Yes, a primary benefit in my opinion especially as you broaden your customer base or your existing customers broaden the set of data points they want to provide.

    Flat file saves me the step of parsing the XML (I'll admit I haven't done much XML with SSIS), drive space, and I can perform transformations / conversions as part of the Data Flow Task.

    Not sure if I am reading you correctly, but when SSIS maps the data coming out of the XML files you get data in a pipeline the same way you would from a flat-file and that pipeline can be run through any Transformation Component in your Data Flow Task that you want. Once it's in the pipeline SSIS has no clue it came from an XML file or a flat-file, they are treated the same. The main difference is that one XML Source can generate many pipelines (i.e. Outputs) since the structure is hierarchical. I would suggest you maybe run through one of two demos with SSIS consuming an XML file. I think once you do one or two my proposed solution may make a lot more sense and you'll see how the data enters the Data Flow and is usable with Transforms and Destinations. You may find XML is not for you but it seems you could get some material benefits from using it in your situation.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Seems like it would use a whole lot less resources if you...

    1. Bulk insert just the first row as a blob.
    2. Split the first row to get the header names.
    3. Use those header names to do whatever mapping you need to do and create the staging table.
    4. Use dynamic SQL to build the Bulk Insert.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Orlando Colamatteo - Thursday, August 24, 2017 9:50 PM

    I think we're close. I'm going to try and apply my understanding what's been proposed to to how the data is actually coming in.

    Here's what I'm thinking the data flow will be:

    1. Data is delivered in a variety of flat file formats.
    2. We use PowerShell scripts to transform the data into a common format.
    3. SSIS package then picks up the common formatted file and processes.

    We're on the same line of thinking but for clarity, in the case of XML, Solution 2, I'd propose we split step 2. 

    1. Data is delivered in a variety of flat file formats.
    2. We use a customer-specific PowerShell script to adjust the column header to match the field names used in the common format. This script would only be needed for customers that sent files with column headers that contained alternate field names, e.g. DateOfBirth instead of date_of_birth. This is a one-line PowerShell script for the customers where you need to produce one.
    3. We use one generic PowerShell script to transform any incoming customer data file into XML. This script would be called for every customer file.
    4. SSIS package then picks up the common XML file and processes.

    Using Date of Birth as the example. What we're getting is that Customer_A sends it as DOB, Customer_B doesn't sent it at all, and Customer_C sends it as BirthDate. I'd have three PoSh scripts to copy the file data to the common format where the field is called date_of_birth (we can pretend Customer_D sends it in with that exact name).

    In the XML approach you'd only need to create two PoSh scripts, one for Customer_A that changed the column header from DOB to date_of_birth and one for Customer_C that changed the column header from BirthDate to date_of_birth. Since the Customer_B does not send the data the XML produced from their file will not have that attribute and SSIS will treat it as a NULL when it consumes the XML.

    Scenario 1: Existing column. 

    In either solution, I expect I'd be passing an empty string to the date_of_birth field for Customer_B.

    With XML you simply would not see the attribute for date_of_birth in the XML file. In XML the absence of the element or attribute implies NULL whereas in flat-files the field still needs to be there, yet be empty, so structure is mixed with data and that makes it rigid when compared to XML. In the SSIS Package, even though it is setup to map date_of_birth if it is not there it treats it as NULL. This is an important advantage over flat-files given that in a flat-file structure unless you agree on a sentinel value for NULL with your data provider you have no way to differentiate NULL from an empty string.

    Scenario 2: New column. No customers are sending me Date of Birth. Now Customer_B wants to send us DateOfBirth. In Solution 1, I'd have to update all my PoSh scripts to pass an empty value to the data_of_birth column with the script for Customer_B passing an actually value to date_of_birth. Again, because we need to have that column defined in the common format for it to be processed. I'd then have to modify my SSIS to handle the new column.

    Agreed. So when you are using a flat-file as your common format all PoSh scripts would need to change to introduce a new data point for any one customer. This is where the XML solution really starts to look more attractive from my standpoint.

    In Solution 2, I'd have to modify the just the PoSh script for Customer_B to add date_of_birth to the XML as an element. The other scripts don't need to be touched as I only care about the element once I get to parsing. I'd then have to modify my SSIS to handle the new column (parse the data if it exists, ignore if it doesn't).

    With Solution 2, if you decided to simply adopt DateOfBirth as the column name you would have no changes to make to any PoSh scripts. If you wanted it to be date_of_birth, then yes, you'd have to change Customer_B PoSh script to flip the column header from DateOfBirth to date_of_birth before converting the file to XML format.

    Does that make sense? Am I missing anything?

    It seemed to and I think I was able to follow you all the way through. Please let me know if I went off track.

    From what i can tell XML allows me to add elements without having to modify the bunch of customer specific PoSh scripts to add a new field.

    Yes, a primary benefit in my opinion especially as you broaden your customer base or your existing customers broaden the set of data points they want to provide.

    Flat file saves me the step of parsing the XML (I'll admit I haven't done much XML with SSIS), drive space, and I can perform transformations / conversions as part of the Data Flow Task.

    Not sure if I am reading you correctly, but when SSIS maps the data coming out of the XML files you get data in a pipeline the same way you would from a flat-file and that pipeline can be run through any Transformation Component in your Data Flow Task that you want. Once it's in the pipeline SSIS has no clue it came from an XML file or a flat-file, they are treated the same. The main difference is that one XML Source can generate many pipelines (i.e. Outputs) since the structure is hierarchical. I would suggest you maybe run through one of two demos with SSIS consuming an XML file. I think once you do one or two my proposed solution may make a lot more sense and you'll see how the data enters the Data Flow and is usable with Transforms and Destinations. You may find XML is not for you but it seems you could get some material benefits from using it in your situation.

    Thanks Orlando, this helps a lot.  I'm going to try and put together a proof on concept using this method in the next month or two.

  • Jeff Moden - Thursday, August 24, 2017 11:47 PM

    Seems like it would use a whole lot less resources if you...

    1. Bulk insert just the first row as a blob.
    2. Split the first row to get the header names.
    3. Use those header names to do whatever mapping you need to do and create the staging table.
    4. Use dynamic SQL to build the Bulk Insert.

    Hey Jeff,

    When you say Bulk Insert do you mean BCP or via SSIS?  Also, can you expand on how you'd do step 3 (procedure, SSIS expression, other)?

    Thanks.

  • Tom_Hogan - Friday, August 25, 2017 8:07 AM

    Jeff Moden - Thursday, August 24, 2017 11:47 PM

    Seems like it would use a whole lot less resources if you...

    1. Bulk insert just the first row as a blob.
    2. Split the first row to get the header names.
    3. Use those header names to do whatever mapping you need to do and create the staging table.
    4. Use dynamic SQL to build the Bulk Insert.

    Hey Jeff,

    When you say Bulk Insert do you mean BCP or via SSIS?  Also, can you expand on how you'd do step 3 (procedure, SSIS expression, other)?

    Thanks.

    No.  There's a BULK INSERT statement in T-SQL and this would be done without SSIS.  I'll see what I can do about a demo after work late tonight or tomorrow.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Tom, One other thing to consider...has your company talked about moving any of its infrastructure to the Microsoft Cloud (Azure)? If so, the solution Jeff is proposing could not be ported to an Azure SQL Database service due to the file system access it relies on. In addition to choosing solutions that insulate me from changes unpredictable data partners make, I also look to choose solutions that insulate me from specific hardware or infrastructure choices. Leveraging PowerShell & SSIS from outside the database engine as opposed to using xp_cmdshell and BULK INSERT from inside the database engine will allow you a ton more flexibility long-term in this area.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Orlando Colamatteo - Sunday, August 27, 2017 9:34 PM

    Hi Tom, One other thing to consider...has your company talked about moving any of its infrastructure to the Microsoft Cloud (Azure)? If so, the solution Jeff is proposing could not be ported to an Azure SQL Database service due to the file system access it relies on. In addition to choosing solutions that insulate me from changes unpredictable data partners make, I also look to choose solutions that insulate me from specific hardware or infrastructure choices. Leveraging PowerShell & SSIS from outside the database engine as opposed to using xp_cmdshell and BULK INSERT from inside the database engine will allow you a ton more flexibility long-term in this area.

    There are no plans or discussions to move anything to Azure or any cloud-based database solution.  I'd be surprised if there were any discussions in the next couple of years.

    I've been researching options to see what would meet our needs while being maintainable with the skill sets we have in-house.  The suggestions have been very helpful.  I've been lucky that at my last couple of employers we've been able to dictate our file formats to a couple of options for each database application.  That made implementing SSIS solutions a whole lot easier (when you have a hammer...).

  • Orlando Colamatteo - Sunday, August 27, 2017 9:34 PM

    Hi Tom, One other thing to consider...has your company talked about moving any of its infrastructure to the Microsoft Cloud (Azure)? If so, the solution Jeff is proposing could not be ported to an Azure SQL Database service due to the file system access it relies on. In addition to choosing solutions that insulate me from changes unpredictable data partners make, I also look to choose solutions that insulate me from specific hardware or infrastructure choices. Leveraging PowerShell & SSIS from outside the database engine as opposed to using xp_cmdshell and BULK INSERT from inside the database engine will allow you a ton more flexibility long-term in this area.

    Agreed.  Azure doesn't lend itself well to such things even if the use of xp_CmdShell probably isn't necessary here.

    Since I'm no wizard when it comes to Azure/etc, have you found any shortcomings for either PowerShell or SSIS in the cloud compared to on-site systems?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply