Using the Konesans File Watcher Task in SSIS to Process Data Files

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    Comments posted to this topic are about the item Using the Konesans File Watcher Task in SSIS to Process Data Files

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • Scott D. Jacobson

    SSCertifiable

    Points: 6039

    This looks neat but I have 1 question. Please pardon my ignorance as I've never used a 3rd party Task object in SSIS. Say I develop a package using this on my machine or in a test environment. When I go to deploy to a production server, that machine must also have this installed correct? I would think so but I could also see that once the package is built, maybe the code is already in the package and the component itself is not necessary?

    If I had to rephrase, is the component merely required to build the package or is it also required for execution?

  • dcdanoland

    SSChasing Mays

    Points: 647

    Thanks for the article. I've been intrigued by this component, but have shied away from it because I have qualms about the architecture. I worry about having an SSIS job run continuously. I just don't think SSIS was designed to run that way. I doubt that Microsoft tests this capability. What happens if an error occurs and the package ends prematurely? I think the package would have to be manually restarted. As much as I like doing things in SSIS, I think we'll end up writing a Windows service in .NET because it will give us more control in a more standardized architecture. Unfortunately, writing a service will be a lot more work. I guess I could be persuaded to use the FileWatcher task within SSIS, but I would need real-world results and a thorough discussion on the architectural pros and cons of using the task.

  • James Stephens

    SSC-Addicted

    Points: 497

    I agree the Konesans may be easier to implement, but I just finished a rebuild of a set of ETL tasks in which I gutted Konesans and went to WMI.

    Not sure if it was a version issue, but Konesans introduced file-locking issues, and a several other 'non-specific' odditites, errors, and instability that started as soon as my team had implemented the Konesans tool into the package. All issuses disappeared now that WMI is used. Coincidence? Maybe. But I've got heavy production and I'm running smooth now and not sweating it out and getting phone calls at all hours of the evening with complaints that the job didn't load data.

    This is just one anecdotal bit of evidence, but I'm interested to see of others have come across strange locking and resource issues with Konesans.

    --Jim

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    Scott D. Jacobson (8/13/2012)


    This looks neat but I have 1 question. Please pardon my ignorance as I've never used a 3rd party Task object in SSIS. Say I develop a package using this on my machine or in a test environment. When I go to deploy to a production server, that machine must also have this installed correct? I would think so but I could also see that once the package is built, maybe the code is already in the package and the component itself is not necessary?

    If I had to rephrase, is the component merely required to build the package or is it also required for execution?

    Unfortunately the code required to implement the File Watcher Task functionality is not embedded into the SSIS package itself when it is built and deployed. Only a reference and the metadata required to instruct the external component is added to the SSIS package. You must run the File Watcher Task installer on all development workstations and servers where you want to run a package that leverages the component. This is a downfall to using thrid-party SSIS components, and one of the major barriers for adoption in my opinion (there are others, closed-source being the second one that comes to mind in the case of the File Watcher Task).

    In my previous article I showed how to use the WMI Event Watcher Task included in SSIS to accomplish a similar result. The Konesans File Watcher Task offers some advantages to using the WMI Event Watcher when in comes to ease of use, but it has some disadvantages too. Thanks for having a look at the article.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    dcdanoland (8/13/2012)


    Thanks for the article. I've been intrigued by this component, but have shied away from it because I have qualms about the architecture. I worry about having an SSIS job run continuously. I just don't think SSIS was designed to run that way. I doubt that Microsoft tests this capability.

    I must respectfully disagree with your assertion. I can only offer you indirect evidence since I do not work for Microsoft but the example provided in the WMI Event Watcher Task documentation implies they expect an SSIS package might be run continuously:

    From WMI Event Watcher Task (SQL Server 2012):

    The following query watches for notification that the CPU use is more than 40 percent.

    [font="Courier New"]SELECT * from __InstanceModificationEvent WITHIN 2 WHERE TargetInstance ISA 'Win32_Processor' and TargetInstance.LoadPercentage > 40[/font]

    The following query watches for notification that a file has been added to a folder.

    [font="Courier New"]SELECT * FROM __InstanceCreationEvent WITHIN 10 WHERE TargetInstance ISA "CIM_DirectoryContainsFile" and TargetInstance.GroupComponent= "Win32_Directory.Name=\"c:\\\\WMIFileWatcher\""[/font]

    dcdanoland (8/13/2012)


    What happens if an error occurs and the package ends prematurely? I think the package would have to be manually restarted.

    If the package ends prematurely then you would need to start the package again. Ideally a package will never end prematurely, however if it does it either means the exception handling needs improvement or manual intervention is intentionally provoked, or in some cases both.

    dcdanoland (8/13/2012)


    As much as I like doing things in SSIS, I think we'll end up writing a Windows service in .NET because it will give us more control in a more standardized architecture. Unfortunately, writing a service will be a lot more work. I guess I could be persuaded to use the FileWatcher task within SSIS, but I would need real-world results and a thorough discussion on the architectural pros and cons of using the task.

    I am of the opinion that there is nothing inherently wrong with running SSIS packages continuously. The example I showed above taken from the Microsoft documentation ought to be sufficient to show that it was one intended use of SSIS, however feel free to disagree.

    If I break it down SSIS is nothing more than a set of XML instructions (open a dtsx file in Notepad). The runtime (DTExec.exe) simply interprets those instructions. So, when I think about running SSIS continuously I'm really thinking about running an instance of DTExec.exe continuously and DTExec.exe is an executable just like any other executable. It is subject to memory leaks, runaway threads, unhandled exceptions, race conditions, etc. however I am not too concerned about it given the level of effort Microsoft has put into developing and testing it. Couple that with the number of users in the community helping to solidify its usage patterns and report bugs. I am aware of some memory leak issues in SSIS 2005 with the Foreach Loop Container as well as the Execute Package Task but those were quickly corrected. There are many people using those tasks (and the other built-in tasks) therefore the problem surfaced quickly and a fix was produced by Microsoft.

    In the case of the Konesans File Watcher Task, while it is not produced by Microsoft its central functionality is built around a class in the .NET Framework, FileSystemWatcher Class, which brings it into the same domain as other built-in SSIS tasks in terms of being backed by Microsoft.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    James Stephens (8/13/2012)


    I agree the Konesans may be easier to implement, but I just finished a rebuild of a set of ETL tasks in which I gutted Konesans and went to WMI.

    Not sure if it was a version issue, but Konesans introduced file-locking issues, and a several other 'non-specific' odditites, errors, and instability that started as soon as my team had implemented the Konesans tool into the package. All issuses disappeared now that WMI is used. Coincidence? Maybe. But I've got heavy production and I'm running smooth now and not sweating it out and getting phone calls at all hours of the evening with complaints that the job didn't load data.

    This is just one anecdotal bit of evidence, but I'm interested to see of others have come across strange locking and resource issues with Konesans.

    --Jim

    I personally have not had file locking issues with Konesans. It may depend on which events your predecessors keyed off of when using the Konesans File Watcher Task. At any rate it sounds like you have reached a more stable solution going straight to WMI.

    My previous article Using the WMI Event Watcher Task in SSIS to Process Data Files[/url] may be of more interest to you. The Konesans File Watcher Task and the WMI Event Watcher Task implement similar functionality, but differently internally and under certain stressors unique to each production environment one may be more suitable than the other. Thanks for sharing your experience and for having a look at the article(s).

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • dcdanoland

    SSChasing Mays

    Points: 647

    You make very good points regarding Microsoft testing and the WMI Event Watcher Task. I guess my qualm is that it seems to be an afterthough by Microsoft. All the other tasks are designed to not run continuously, and SSIS's architecture seems designed for such tasks. It doesn't seem as if SQL's architecture is really designed for continuously running tasks. Otherwise they would provide more support for that would be somewhat like SQL Agent.

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    dcdanoland (8/13/2012)


    You make very good points regarding Microsoft testing and the WMI Event Watcher Task. I guess my qualm is that it seems to be an afterthough by Microsoft. All the other tasks are designed to not run continuously, and SSIS's architecture seems designed for such tasks. It doesn't seem as if SQL's architecture is really designed for continuously running tasks. Otherwise they would provide more support for that would be somewhat like SQL Agent.

    I agree, it can seem that way. Running packages continuously is typically not the first usage pattern that comes to mind when designing a process. In my experience the goto for batch processing and ETL design is usually some variation of:

    1. start a package that checks for the existence of a file

    2a. if the file is there load it and exit

    2b. if the file is not there exit

    Then a SQL Agent job will be setup to run this process once per minute, or once every few minutes, during a time period when the file is expected to arrive. The pattern is intuitive, is tried and true, and does work fine in most cases but adds a lot of transient overhead to the server in terms of CPU, memory allocations/deallocations as well as msdb activity as is detailed in the article. Some doubts about this same topic were (hopefully) laid to rest in the comments of my WMI Event Watcher Task article as well if you have a moment to read through those.

    The precedence for running an Agent job continuously was set long ago as well, by Microsoft. In the SQL Server Replication subsystem the Log Reader and Distribution Agents are always running per a SQL Agent job that is created when replication is configured. Again, it may not be intuitive to have a job that runs an SSIS package continuously but in my opinion there is nothing inherently wrong with the design pattern and in some cases it may be the most suitable option.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • dcdanoland

    SSChasing Mays

    Points: 647

    Thanks for referencing the discussion on the WMI article. It was most helpful. Your description of the standard batch processing and ETL design matches my experience.

    What sort of load have you used on the WMI file watcher package? We typically have about 300-1,000 files in an 8 hour window.

    Multi-threading always makes me nervous, particularly in an SSIS script task. Granted, all you do is "Thread.Sleep(5000)", but does that allow other tasks and processes to take over and possibly inflict some undesirable side-effects? In short, is it thread-safe?

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    dcdanoland (8/13/2012)


    Thanks for referencing the discussion on the WMI article. It was most helpful. Your description of the standard batch processing and ETL design matches my experience.

    What sort of load have you used on the WMI file watcher package?

    I use the pattern as a replacement for the pattern where a package starts, looks for a file, and processes it if it exists or exits if it does not. If I break it down it is the exact same pattern implemented using the aforementioned components with the difference lying in the division of responsibility. WMI and the File Watcher class are just different polling mechanisms. In the WMI query the WITHIN clause dictates how often the file-creation-event is polled. WMI is lightweight in and of itself, and considered even more lightweight when compared to loading an SSIS package into memory, validating it and starting to execute it each time we want to check if a file exists. In most cases 5 seconds is far too often to load an SSIS package to check if a file exists. The overhead of executing an Agent job (think of all the mechanics behind the scenes with respect to msdb job history), loading the package into memory, validating the package, then executing the package...most of the time just those steps take longer than 5 seconds. With a package already in memory and WMI doing the checking on our behalf this is barely a blip (effectively 0) on the CPU.

    We typically have about 300-1,000 files in an 8 hour window.

    I do mention in the comments of the WMI article that on my machine the sample package occupied ~15 MB of memory to keep in memory and used effectively 0% CPU while WMI was polling for a file's arrival. If you do the math and use 1000 SSIS packages to watch for 1000 files at the same time you would need ~15000 MB (15 GB) of RAM to support the scenario where all packages needed to be in memory (required?) at one time and all were as simple as the sample package (highly unlikely).

    With that many files coming in per night this pattern may not work for you depending on how many you need to watch for concurrently. For this case I would look into further dividing responsibilities into a multi-threaded watcher that calls the appropriate SSIS package based on the file that arrives. I would look at implementing the watcher as a .NET Windows Service or Console Application that ran continuously. In whichever form, WMI or the FileWatcher class would likely be an integral part of my implementation. One of the comments on the WMI article suggested this could also be implemented using the WMI functionality built into SQL Agent which I had not considered but sounds like a good fit, and something worth exploring in environments with larger workloads such as yours (SSIS would still be on the hook for checking exclusive file access though). Whichever implementation direction you went, if you enlisted WMI (in .NET watcher, SSIS or SQL Agent) or the Filewatcher class (in .NET watcher or the Konesans File Watcher Task) you could stand to significantly cut down on your job executions and msdb activity as well as the simplify your standard SSIS package template.

    Multi-threading always makes me nervous, particularly in an SSIS script task. Granted, all you do is "Thread.Sleep(5000)", but does that allow other tasks and processes to take over and possibly inflict some undesirable side-effects? In short, is it thread-safe?

    It is thread safe. The SSIS package is running as part of an instance of DTExec.exe. Nothing can get into the memory space of that instance of DTExec.exe. While the threading library is used, the script task in the sample SSIS package is actually not multi-threaded. The Thread.Sleep used in the WMI Event Watcher example script is only a way to make the script wait until it can check for exclusive access to the file once again, and only if it happens to fail on the first attempt (i.e. the data file has arrived but is still being written to).

    The File Watcher Task handles this scenario for us internally and you should be doing this step in your own SSIS packages too, unless you can guarantee that when your package wakes up and sees a file that the data file provider is done transferring it to the drop directory. What I see implemented often are "trigger files", i.e. a small or empty file that accompanies the data file, that when present represent the idea that the data file is done being transferred and is available to be processed. The agreement is usually that the sender will write the data file first, then the trigger file. The receiver will always look for the trigger file first, then the data file only if the trigger file exists. This method obviates the need for the receiver to check for exclusive file access before beginning processing on the data file but not all senders are able to offer this nicety.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • dcdanoland

    SSChasing Mays

    Points: 647

    Thank you so much for your detailed comments. They have been most helpful. I hadn't thought of dividing the task up into file watching and file processing, but that might be the thing to do. Part of me thinks that part of it is already in .NET land, then I should keep all of it there. That way I don't bother with SSIS and SQL Agent and have maximum efficiency.

    The other part remembers that I'm going to have to sift through crappy C++ legacy code to figure out how process the files. What about 5-10 years from now when the poor sap programmer who will be maintaining my code? He'll be a lot happier if part of my solution is in SSIS than if it's all in VB or C#.

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    dcdanoland (8/13/2012)


    Thank you so much for your detailed comments. They have been most helpful. I hadn't thought of dividing the task up into file watching and file processing, but that might be the thing to do. Part of me thinks that part of it is already in .NET land, then I should keep all of it there. That way I don't bother with SSIS and SQL Agent and have maximum efficiency.

    The other part remembers that I'm going to have to sift through crappy C++ legacy code to figure out how process the files. What about 5-10 years from now when the poor sap programmer who will be maintaining my code? He'll be a lot happier if part of my solution is in SSIS than if it's all in VB or C#.

    My pleasure. I have enjoyed our conversation very much. Thank you for your responses.

    It is true that we should consider those that may have to work after us. I think we have all had to work on code at one time or another where it was clear that the previous developer(s) were not as considerate.

    For any problem at hand I say use the best tool for the job. Sometimes that means using different technologies to support different areas of a workflow or system. Implementing a watcher that watches for new files and spawns a new thread to call an SSIS package to process those data files seems much simpler to implement as a .NET Windows Service or Console App than as an SSIS package. Thinking aloud, if one were to try to do this in SSIS chances are it would need to be partially done with a Script Task anyway, or if not a Script Task then done by brute force with a WMI Event Watcher for each file that might arrive which would need to be modified each time a new file needed to be watched for, or one were decommissioned.

    For actually processing a file within a thread spawned by the watcher SSIS is clearly a superior file-processing tool and is a good choice in my opinion. Both the watcher and the SSIS package would be implemented using a managed language, and speaking for myself that is a quality of any technology choice where I place a very high value.

    The good news is that if the watcher was built as a .NET Service or App, if it is built in such a way that it can be told to watch for new files and execute a specific package based on that file per a database table entry then the code should not need to change often and be simple enough for most any capable SSIS developer to sift through and eventually understand.

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

  • tonyclifton77

    SSC Veteran

    Points: 238

    Hello,

    thanks for this great and detailed article. I've just started to get my head around SSIS and since I have a quite similar task to deal with, I'll take the liberty to ask for some advice in this thread. I hope it's not too off-topic.

    In my scenario a bunch of excel files could be placed into the "ExcelDrop"-folder at the "same" time. So I figured I need a for-each loop. However, I will have to split that one package into one queue and job package(s).

    The queue will monitor the folder, if an excel file is placed into the folder I then need to compare the filename with the filename in a database table and look for the value of a "process"-column in the same table. Depending on the value of this process-column an according job-package should be triggered. So if the value of that column is "ABC", job-package "abc" should be executed. In total I will have six of these job-"trigger-dependencies". I'm thinking about switch/case statement in script-task but you'll probably have a better idea.

    For the queue I was thinking about this structure:

    1) File-Watcher

    2) For-Each Loop

    3) Check file name

    4) trigger child package based on file name

    5) depending on the return code move the file to a success or error folder and (maybe) write some logs to a file or a database.

    - do I need the FileWatcher inside the for-each loop?

    I'm not clear how to set up the job-package in terms of processing.

    - I believe I need a dataflow task (Excel source, DB destination) in here and not in the queue-package, right?

    - how would I pass the variables?

    I would really appreciate some help on how to structure the queue-/job package setup.

    General Questions:

    - Can all of this be done with the SSIS-toolbox components or do I need script tasks for this?

    - Instead of having to run the package all the time could I also get it to work with the sql server alert, when I need to monitor a unc-server path? I believe I could set the server path to a shared drive and monitor that drive with its folder. If I can't, how do I set-up the package to run 24/7 without having to use Visual Studio?

    - in this article you only process .xlsx files but I would also need it work for .xls and .xlsm files. Since this folder will only receive Excel-files could I just set it to the Filename-prefix and the extension like this * ?

    that's it for now 🙂

    I'd be happy to receive some tips/instructions.

    Thanks for reading.

  • Orlando Colamatteo

    SSC Guru

    Points: 182269

    Hello Tony, thanks for reading my article. I'll say this, in just starting out with SSIS you have a lot of ramping up to do. You'll need to cover not only a lot of the fundamental concepts but also some intermediate and advanced concepts in order to accomplish your task. I'll do my best to pick out some items to comment on that I think will move you forward. Feel free to follow-up with more questions.

    I'm thinking about switch/case statement in script-task but you'll probably have a better idea.

    I doubt you'll need a Script Task for this. Basically the Execute Package Task can be dynamically setup to execute the proper package per what is in your control table. Brush up on SSIS Expressions and Variables.

    For the queue I was thinking about this structure:

    1) File-Watcher

    2) For-Each Loop

    3) Check file name

    4) trigger child package based on file name

    5) depending on the return code move the file to a success or error folder and (maybe) write some logs to a file or a database.

    - do I need the FileWatcher inside the for-each loop?

    See the image I posted below. I think a File Watcher Task would be a good thing to have in an outer For Loop Container where the condition is always true (e.g. 1=1). In this mode the package would run continuously. Once the File Watcher Task sees a file, then you would drop into a For Each Loop Container to process the files in the directory.

    I'm not clear how to set up the job-package in terms of processing.

    - I believe I need a dataflow task (Excel source, DB destination) in here and not in the queue-package, right?

    - how would I pass the variables?

    This question is more about the child packages of which it sounds like you will need 6 different ones. The image below is for the outer package.

    There are different ways in which you can share variables in a parent package with a child package. See section Passing Values to Child Packages

    in the Execute Package Task article.

    I would really appreciate some help on how to structure the queue-/job package setup.

    General Questions:

    - Can all of this be done with the SSIS-toolbox components or do I need script tasks for this?

    The Konesans File Watcher Task is a third-party component. You could do all of this with the WMI Event Watcher Task, but the File Watcher Task makes it a bit easier. I also wrote an article on the WMI Event Watcher Task[/url] if you have a requirement to only use built-in tools.

    - in this article you only process .xlsx files but I would also need it work for .xls and .xlsm files. Since this folder will only receive Excel-files could I just set it to the Filename-prefix and the extension like this * ?

    If I am not mistaken, if you simply use a mask of *.xls it will act as *.xls*. It works that way when searching with Windows Explorer but may not with SSIS so you should test it to verify. Worst case scenario you can look for *.* and then ignore files that do not have a row in the control table using a precedence constraint on the Execute SQL Task. Since no rows will be found you will not populate any variables with the output of the query.


    I figured a picture would be worth a thousand words in this scenario. See if something like this might work for you:

    __________________________________________________________________________________________________
    There are no special teachers of virtue, because virtue is taught by the whole community. --Plato

Viewing 15 posts - 1 through 15 (of 30 total)

You must be logged in to reply to this topic. Login to reply