SSIS Design Pattern - Staging Fixed Width Flat Files

  • Comments posted to this topic are about the item SSIS Design Pattern - Staging Fixed Width Flat Files

  • In almost all cases, parsing in the data flow pefroms (sic) better than parsing at the source, if not at the same performance level.

    Sam, I'm sorry, this sentence doesn't make any sense to me. If something performs better, say so. But adding the qualifier "if not at the same performance level" negates any meaning.

    I'm not saying your methodology is wrong - in fact, in terms of human effort, it's vastly superior IMO, but I can't work out whether you've added the qualifier beacuse the total processing time is increased (is it?) or for some other reason.

  • Moreover, if the value is unknown or unavailable you’ll see a blank string, such as ‘ ‘. It is often better to convert these blanks to NULL. You can off course load the blank value to database column without any transformations, and I think it's again my preference to convert it to NULL.

    Just to give a different view. I would convert it to '' if the value was to indicate that there was no value. I would convert it to NULL if the value was to indicate the value was not known or that it's in some ways not certain. If that's not able to be determined, it would be a judgement call.

  • Thanks for the article. We do a lot of these sort of imports and have many of the same issues such as dealing with blanks. But we have already determined what to do with them when we created our destination files by setting whether or not to allow nulls.

  • Its my suggestion ,In case of using two Derived Column and one Data Conversion in between source and Destination, it is performance Factor .When we are aware of length of each column then use Fixed width in Flat File Connection Manager and in advanced tab add new column as per requirement and order of the file then change length, datatype and length and map directly with Destination table.

    This will reduce Transformations used and give High Performance when loading Billion Records from source file.

  • This is a good start, however I do not see any error handling. Do you just let the package fail? If so, how would you troubleshoot a failure?

    Gerald Britton, Pluralsight courses

  • We do a lot of file imports and exports. One thing that really helps is to add a last update column to your staging table. We've had many cases where the package did not send an email error notice when it failed. By looking at the staging table update column we can quickly see if the data processed or not.

  • t.pinder (10/6/2015)


    In almost all cases, parsing in the data flow pefroms (sic) better than parsing at the source, if not at the same performance level.

    Sam, I'm sorry, this sentence doesn't make any sense to me. If something performs better, say so. But adding the qualifier "if not at the same performance level" negates any meaning.

    I'm not saying your methodology is wrong - in fact, in terms of human effort, it's vastly superior IMO, but I can't work out whether you've added the qualifier beacuse the total processing time is increased (is it?) or for some other reason.

    t.pinder - Thanks for this comment, I really like what you said. Definitely something for me to learn as I write more!

  • Honny (10/6/2015)


    Its my suggestion ,In case of using two Derived Column and one Data Conversion in between source and Destination, it is performance Factor .When we are aware of length of each column then use Fixed width in Flat File Connection Manager and in advanced tab add new column as per requirement and order of the file then change length, datatype and length and map directly with Destination table.

    This will reduce Transformations used and give High Performance when loading Billion Records from source file.

    Both Derived Column and Data Conversion transformations are asynchronous, meaning they shouldn't be a drag on performance. Having said that, I do plan on do a performance test and blog about it. It could take more than couple of weeks though.

  • g.britton (10/6/2015)


    This is a good start, however I do not see any error handling. Do you just let the package fail? If so, how would you troubleshoot a failure?

    As I stated in the article, error handling is out of the scope - I don't suggest that it's not required. I wanted to focus on the actual pattern.

  • This is a very good point. To me it is unforgivable to allow process aborts. Better have a table for error rows and/or groups of rows. Then send notifications to the data owner and even return invalid data that was not processed. You must never create processes that require IT personnel to intervene. Bad data must be rejected and returned to the creator. Depending on the nature of the data, you may need to reject a whole file batch or complete documents or whatever the application requires, but be sure YOU don't have to handle problems.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • if a field is a 6 character number, you’d need 8 bytes to store this numeric field as a string using VARCHAR(6) data type (one byte for each character plus 2 bytes for overhead, according to Books Online). Instead, if you convert it to an INT data type, you’d need only 4 bytes

    This tends to be something to approach with caution. If the value isn't going to have math done on it then it is probably isn't a number. A US ZIP code is technically a number value. But it has leading zeros so converting it to a numeric data type will make the data incorrect.

    I'm not saying to never convert numeric type strings to numeric data types. I'm saying you need to understand your data and plan your fields correctly.

  • This tends to be something to approach with caution. If the value isn't going to have math done on it then it is probably isn't a number. A US ZIP code is technically a number value. But it has leading zeros so converting it to a numeric data type will make the data incorrect.

    I'm not saying to never convert numeric type strings to numeric data types. I'm saying you need to understand your data and plan your fields correctly.

    Agree completely. Some consultants not from the US wanted us to use an int field for the SSNs. They didn't know that many SSNs begin with zero. The math rule is a good one, especially for numbers that exist outside the local enterprise. I have used them for internal numbers such as SKUs where I'm assured they will always be a non-zero starting number.

  • I'm guilty of using Int for SSN a long time ago (and I'm not from the US). You can imagine that was short lived!

  • I read this a few days ago and it looked very similar to my workflow but the one thing that popped out at me was how you read your file in as a single column.

    "Really?!" I thought, "That sounds kind of fiddly to me."

    This morning, I get an email from the business line: "If our vendor adds a column to the end of that production file, will it break your import?"

    Yes, it will. Wouldn't it be great if there were some way to avoid that .... oh, yeah ....

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply