SSIS Design Pattern - Staging Delimited Flat Files

  • I agree with bringing data in as pure text (varchar) to what I call a "raw data" staging table, before doing the data conversion in another step. One thing missing from this design pattern is an "Exceptions" table, where rows that fail data conversion -- or (in another step) data validation -- can be copied. The "Exceptions" table should contain a column for error type, and a limited number of descriptive, pre-defined exception types can be used, which simplifies generating exception reports later.

  • I agree with all that data conversions should happen in SQL, import all as varchar. I worked with a programmer whom tried determining data types in SSIS and ncoa'd 1000's of records without an apartment number.

    I usually use a for each container and record the import file name in the staging table.

    I usually add an identity column to the staging table

    For added flexibility I usually have the processing path set in a variable, makes it easier to move from staging environments to production.

    I usually archive the files I process too, if the name is not unique, add a date stamp to the file name.

    If this process is going to be run regularly, logging is also a good idea.

    Last but not least, if there is data processing that needs to occur afterwords, I make sure that Stored Procs are used (as opposed to SQL Tasks in SSIS)

    These are easier to modify outside of SSIS, and if this gets moved to production makes life easier for maintenance.

    While there are cases where you wouldn't want to do some of the steps I mentioned, this is how I do most of my ETL.

  • For added flexibility I usually have the processing path set in a variable, makes it easier to move from staging environments to production.

    Chris, I hope you mean something like a package config. or package/project parameters (SSIS 2012 and above). Other than that, most of them are good points and needed frequently depending on the situation. I couldn't possibly all of them in one article. The pattern shows the bare minimum - getting that flat file to the staging table. Depending on what else is needed (looping, logging, restartability, etc.,) you'd need to build them on top of the actual data flow. Thanks for the comment!

  • Alan.B (4/20/2015)


    Great article Sam. Simple, to the point, easy read - well done sir!

    Thanks for the kind words! Glad you liked it.

  • I also like to keep track of what row a record is in the dataset so I can report back where the problem is, which field, and the nature of the error.

    Business generally wants to know all the errors in the file, not just the first one so treating the input fields as character is a good idea. If the row passes the edits, its fields can be converted to the correct format.

  • So, it’s better to convert non-string data and use appropriate data types in the staging table.

    Completely disagree. Your first job in extraction is to own the data. If I'm importing from excel, then all columns are nvarchar(255). From text files everything is a string. Make changes from the landing table to the staging table. It will be much easier to manage exceptions if the data is in a landing table.

    I agree with keeping the same column names for the same reason. Your first job is to own the data. Make it look as close to the original data source as possible.

  • Please, PLEASE stop calling stuff like this a 'Design Pattern' - this is maybe a template, maybe a 'tip' - but it's not a 'Design Pattern'.

    Go look up the GOF book, calling this a design pattern is a joke.

Viewing 7 posts - 16 through 21 (of 21 total)

You must be logged in to reply to this topic. Login to reply