Should SSIS be used for the following?

  • Hi,

    Can SSIS be used to implement the following:

    1. Each company will supply a monthly text file.

    2. A companys file format will be the same each month.

    3. No two companies will use the same file format and column orders/number of columns as each other.

    4. I want to poll a windows folder for the text files and import them into a staging table, ready for validation.

    I can do this using a stored procedure by creating a format file for each client and storing the name of it in a table. The client name is included in the file name and this can be used to look up the appropriate format file from a table which is added dynamicaly to the Bulk Insert command.

    Is it practical to create this solution in SSIS or am I better of sticking with stored procs?

    Thanks

    Jon

  • I would try to create one package in SSIS and see how long it takes to write one component. If the time to develop an SSIS package is excessive and performance isn't a factor, then you'll probably be better off sticking with the Stored Procs for the short-term. I'd probably go with SSIS overall because it gives you a lot of other controls and workflows that would be tedious to code in TSQL.

    -Pete

  • I'd be interested to see how you solve point number 4 in SSIS, or T-SQL for that matter, if you mean that the package will run continuously, polling the location for the file. If you mean that on invoation of the package, you poll to check for existance, then that's a simpler process.

    I've just finished developing (moving into test now, so things may change 🙂 ) something similar, so let me share my experiences. I should add that these are my first ever 'real' SSIS packages, and that I've only been using SSIS for about 3 weeks.

    1st, to describe what I need to do. I have 3 excel files, with complete different formats, that need to be imported into 3 seperate tables, where another (external) process will use the data at a later point in time.

    I created 4 packages. 3 of them are very similar, and have the following structure (the reason I've detailed that is I suspect you'll have several of the same steps.):

    --------- excel file import process-------

    1. .Net script to check for existance of file.

    If found, set Output variable 'FileFound'= 'Found' (Dts.Variables("FileFound").Value = "Found").

    If not found, set dts.taskresult = dts.results.failure, and set FileFound to 'NotFound' (Dts.Variables("FileFound").Value = "NotFound")

    2. On success, evaluate FileFound, if FileFound == 'Found' then invoke a SQL script to cleanout staging table (truncate it). The reason I do a 'double-check' here, by setting the precedence constraint to expression and constraint is that I can, if required, change the dts.taskresult allocation if the business decides in future that the non-existance of the file does not warrant a package failure.

    3. On Success, I invoke a data flow task that manages the import of the specific excel file. The data flow components are:

    3.1 excel file source

    3.2 conditional split to evaluate the data from the excel file - I send problem data (incomplete, wrong data types etc.) to a Error table.

    3.3 data conversion - convert the 'good data' to the specific data types of the target columns in the target table.

    3.4 ole db destination - connected to the specific table. I am using fast-load to load the data.

    4. On success of the data flow, I archive the imported excel file to an archive location.

    ---------------

    There after, I have a 4th package. This is the 'neat' stuff that is pretty simple with SSIS. I created the 'orchestration package' that manages all of the packages. All it does is invoke, using the execute package task, the 3 package for the 3 files I want (i.e. I have 3 seperate Execute SSIS package tasks, with NO precedence constraints). The neat bit is that, since I don't set up any precedence constraints here, the 3 files are imported in parallel, while I can report on the success of the 3, and handle the invocation of them, as a single unit.

    The usage is: I have a SQL server agent job that runs of a schedule (hourly for business hours). Each hour it invokes the orchestration package, and then completes. That orchestration package invokes each of it's children packages (3 at the moment, but I could quite easily add more later), in parallel. When they complete, they bubble up their success/failure to the orchestration package, which complete and reports it's success failure based on@ all 3 succeed = success, any 1 fails = failure.

    Now, since each of the imports are also SSIS packages, the support team can investigate the specifc reason for a package failing (for example file not there), and invoke the specific package to process that file - a single call.

    The other things I've used are configuration files that simplify changing the details for the packages between the different environments - no need to hardcode the SQL Server's being used, or the filelocations or archive locations. Finally, I've set up logging, so that the SSIS packages log their activity to a SQL Server table.

    --- final thoughts ---

    all the above can be done in the SP approach. However, I would want to do some of the tasks, like file existance, in t-sql (which would involve, afaik, use of the MS xp_fileexists sp, or even worse, using xp_cmshell). Doing the same in the .Net script is a single line: If File.Exists(CStr(Dts.Variables("User::ExcelFileLocation").Value)) Then

    the logging is pretty straightforward, and powerful, in SSIS packages. You have 5 target locations, and you can set up to use more than 1 of them. They are, fyi, TEST FILE, SQL PROFILER, SQL SERVER, WINDOWS EVENT LOG, XML. again, you could probably do all of that yourself, but not without quite a bit of time spent, whereas the setup in SSIS is probably 5 minutes work, or less.

    the configuration files mean you don't have to hardcode anything. If you're using t-sql SP's, you don't have to hardcode either, although at some point the code that calls the SP must give it the file locations, so you are going to either hard code it in the calling process, or have to have that process get those details from either a gui or a config file or a db table. Seems like it may be less work using the SSIS with configuration files, although I hasten to add this is the first time I'm using them, so it may be more work than I currently expect.

    My 3 packages have individual run-times of between 4 and 5 seconds, yet my orchestration package runs all 3 in about 5.5 seconds. Setting it up to run them in parallel was 1 minutes work.

    finally - I may find that testing proves problems with my approach, but I thought I'd share it al to help (and possibly other people could comment/imrpove upon what I've given, and I can learn, and improve mine).

    HTH - btw - apologies for any typo's

  • I'd probably have used the For Each File process to look for files in the folder and process anything that matched my criteria with a final task that moved the file from the source folder to a Processed/Completed folder. That would eliminate the need to script that piece and should be a little easier.

    I agree on the regular schedule for the job - maybe every 5-15 minutes or less often depending on how critical the file is. A lot of jobs I had like this ran every 15 minutes or every hour.

    Looks like you have a pretty good setup there. Only gotcha I ran into recently with Excel was when someone entered more than 255 characters into a field that had never had that many characters before. Changed the "type" in Excel to nText and broke all sorts of things. 😛

    -Pete

  • Hmm - thanks for the response!

    I'd probably have used the For Each File process to look for files in the folder and process anything that matched my criteria with a final task that moved the file from the source folder to a Processed/Completed folder.

    Sure, but how do you track when you only found 2 of the 3 files? ForEach would pick up both, move them, then finish? Also, that would mean I'd need to move that task back, out of the individual packages, and into the orchestration, which would mean:

    ( a ) - the file handling would happen sequentially rather than in parallel

    ( b ) - the underlying packages would not be standalone component that would be invokable on their own, since they don't do the file exists/move to the processing location. Granted, the standalone invocation in my scenario should be for support poeple, and it could be a very simple manual action for them to move the file thmeselves, but people are fallible (non moreso than myself!), and they might forget.

    thanks for the info re: the excel files. I wonder if I'll be safe from that, since my Excel source is specified with DATA ACCESS MODE of SQL COMMAND, and I ensure that my data is within the right bounds then. I would have probably done all my data conversion there, except for the fact that this has a limited set of SQL functionality (like no ability to cast dates etc).

  • Actually, I was thinking more of making it part of the package - as a component for each file with a filter set up that matched whatever you need to process - e.g. FILEA*.XLS for one file, FILEB*.XLS for the next, etc. Each of those would trigger the appropriate workflow when found. That way if one of those files did not exist, then it could pick up and run with what it had, then process the remaining when it arrives.

    Of course, if they need to work as a set, that's a completely different issue and I think you are stuck with some form of scripting at that point. (Not completely sure on that, though.)

    As for Excel, I was forced to use the IMEX=1 parameter in the connection string because I get a lot of Zip Code data in ZIP, ZIP+4, and International formats. Unfortunately, Excel tends to pick up the first small number, translate that to a FLOAT, then NULL out anything that can't convert to a float. Ugh. The text thing just bothered me because I had to re-work almost everything to handle that one field. I think I eventually just ran it through in DTS 2000 - no problems and it went in really quickly. I'm just disappointed that it was such a hassle to do in SSIS that I had to revert after wasting close to an hour trying to get it to work.

    -Pete

  • Ok, so you're proposing replacing the .Net script component in each of the 'child packages' with the ForEach file exist process? Not sure that would move me forward, since now I'd have to hardcode the filter into the ForEach, whereas currently I use a variable and the configuration file to be able to specify what the file name and location is.

    They do need to work as a set (currently).

    Hmm - I need to check up on that IMEX=1 setting and see what that does. Currently I also have data that starts with a numeric (e.g. 100-1000) coming in, but using the SQL Command seems to be handling it nicely.

    Having to go and do it through DTS - ouch. Not a great adevert for the product when you revert to previous to do something. Still, it is true (imho) that the path from DTS to SSIS seems to have a few abyss-like crevices and everest-like peaks between the two (in terms of learning curves, that is).

    CiaO

  • I am making some progress with implementing what I discribed in the first post. I'm stuck because of a bug with the Bulk Insert task not picking up variables at runtime. I have used a script task to dynamical set the source text file in the connection string however I'm struggling to set a dynamic destination table in the bulk insert task.

    The table name variable is set using a substring on the text file name. The variable valuable is not picked up in the bulk insert expression for DestinationTableName. How do you set this property via a script task?

    Thanks

    Jon

  • You don't for most things in SSIS. You use an expression and set the "Delay Validation" property to "True" if the table won't exist until the package creates it. There should be something similar for the Bulk Insert task that lets you specify the table name.

    If that's not possible, you may want to define a standard table name for the package, then add a rename command (which can be built off of an expression) as an Exec SQL Task. I ended up having to do that for a bunch of tables in one of my packages because the name could change dynamically.

    -Pete

  • Pete - you're a genius!

    All I needed to do was set the "Delay Validation" property to "True".

    I can now do the following:

    1. I have 4 files in a folder named TestImport1.txt, TestImport2.txt etc..

    2. I have 4 tables names TestImport1, TestImport2 etc.

    3. I can now run a package which will dynamically bulk insert each file into its appropriate table.

    The package uses the filename to determine which table the data should be inserted into.

    The package is made up of:

    1. A foreach loop

    2. A script task to set the connection string for each file

    3. A script task to set the TableName variable based on part of the filename

    4. A bulk insert task which takes it source connection and destination table from variables.

    The next thing to do is for me to dynamically set the format file to use.

    I'm away for 2 weeks but will let you know how I get on.

    Thanks

    Jon

  • Yeah, that "Delay Validation" setting isn't played up nearly as much as it should be from what I've seen. Most of the docs I've looked at hint that it's good for offline things, but most of my work consists of scenarios where I'm creating the tables as one of the first steps in the package or otherwise changing things based on expressions. It's extremely useful for those situations. 🙂

    Have fun getting it to work. It's definitely a learning experience.

    -Pete

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply