It is a common held conception that you can't learn everything from a book and this, in my experience, is particularly true in technology. You can sit down and study a new technology but it isn't until you start using it in anger that you find out:
- What is it capable of?
- What are the best ways of going about things?
- What are the shortcuts that you can employ?
- What works and what does not?
- How does it really hang together?
I have recently completed a nine month stint of implementing SQL Server Integration Services (SSIS) in a commercial environment and this has given me a chance to answer some of these questions. I want to relay some of these answers back to you and give you some insight into my experiences of implementing a SSIS solution. As such, this article assumes a familiarization with SSIS.
That's enough of an introduction. Let's not waste any more time and instead let's deep-dive into a real world SSIS implementation. I hope you find it useful. I have broken this document into headed sections, one for each nugget of information that I want to give you.
Replicate Variables And Reduce Variable Scope
The potential for resource locking is a problem in SSIS, just as in relational DBMSs. SSIS gives us ample opportunity to run tasks in parallel with each other and if parallel operations within a package are competing for resources then in extreme cases you may experience locking problems.
Quite often those resources are within the SSIS package itself – most commonly variables. Variables are often used in different places concurrently – sometimes when you don't even know it. For example, eventhandlers can make heavy use of variables and you have limited control of how and when eventhandlers fire.
In extreme cases then you run the risk of getting errors such as:
- Reading the variable "System::PackageName" failed with error code 0xC0014052
- The expression for variable <var_name> failed evaluation. There was an error in the expression.
- A deadlock was detected while trying to lock variable "<variable_list>" for read access. A lock could not be acquired after 16 attempts and timed out
These are symptoms of being unable to put a lock on a variable.
There are various common-sense things you can do to mitigate the risk of this occurring:
Don't give your variables global scope. Only scope them to the containers in which they are used.
If you are using a variable in different places to do the same job, consider using multiple identically named variables instead, one each for every place where it is used, and scoped appropriately. We found that doing this resulted in a significant decrease in the number of variable locks that we were getting
In script tasks, don't lock the variables using ReadOnlyVariables and ReadWriteVariables properties. Instead, use the Dts.VariableDispenser object to manually lock and unlock as and when required. This will cause your variable to be locked for a shorter period.
Similarly for script components use Me.VariableDispenser instead of ReadOnlyVariables and ReadWriteVariables.
The problem is more likely to occur when running tasks in parallel so if you are experiencing these problems you may want to consider decreasing the amount of parallel activity – as long as it does not impact your business requirement of course.
There were also fixes that went into the post-SP1 hotfix pack to mitigate the risk of variable locking so I would urge you to install that as soon as possible. The fix is that system variables (which are immutable) now get locked for read rather than read-write. Note that this does NOT mean that the problem will never occur.
Consider Custom Logging
SSIS has a first-class configurable logging infrastructure and even provides the capability to develop your own logging mechanisms through a custom log provider or by use of eventhandlers.
I would urge you to consider implementing custom logging as it provides a number of opportunities that aren't possible with the out-of-the-box log providers. For example:
- Log the Major, Minor and Build version numbers of your packages. These are available as package-scoped system variables and the Build version number is incremented each time you save the package. This way you will always know which version of a package has been deployed to your various dev, system test, QA and live environments. Used in harness with a revision control system such as Visual SourceSafe or Visual Studio Team Foundation System enables us to easily rollback to previous installed components.
- Log the number of rows processed into a table all of its own. This makes it easy to see how data volumes increase or decrease over time. You should have separate fields for rows inserted, rows deleted and rows updated. At Conchango we use this technique extensively on all of our data integration projects.
- If you have a parent-child-grandchild package structure you can log what we call the package stack (i.e. a comma delimited list of the packages that have been called) so that you always know how and why a package was called.
- Log an explicitly created LoadID, which should change for every execution. The LoadID can be inserted into every table that is populated as part of your ETL process thus making it easy to tie processed rows back to the load in which they were processed. This is an invaluable step for auditing.
- Log the execution duration of each container in your package(s).
This is not a definitive list. There are a multitude of things that can be logged in addition to those listed here.
One more point about this - if you DO implement custom logging it is still a good idea to adopt what an old boss of mine used to call a "belt-and-braces" approach. In other words, consider using
the out-of-the-box providers alongside your own custom logging.
Do Not Put Small Discrete Operations Into A Separate Package
The lowest executable unit of reuse in SSIS currently is a package. It is not possible, for example, to reuse a single data-flow in different places and then make a change in one place to be reflected in all places that it is used.
Reuse of packages is useful but you should guard against modularising small amounts of functionality that will be executed often into a separate package because there comes a point at which the overhead of spinning up a package many times outweighs the advantage of reuse.
Annotate Annotate Annotate
SSIS packages are, to some extent, self-documenting. It is possible to look at a data-flow or control-flow and instantly get some idea of what the package does – this isn't really possible by looking at, say, a list of stored procedures.
Having said that, there are things you can do to make it easier for another SSIS developer to inherit your package and change it:
- Always annotate your packages. Annotations are SSIS's equivalent of code comments. At a bare minimum drop an annotation onto your control-flow with a high level description of what the package does.
- Every object in a SSIS package has a property called Description. Use it. Fill it in. Descriptions are part of the metadata of the package and hence can be extracted by auto-documentation tools to provide a better illustration of what your package does.
Employ Naming Conventions
It is not possible to look at a log file created by a package and identify what each task or component actually does unless you have prior knowledge of the package hence it is useful to employ naming conventions to specify the type of each object.
For example, if a component called "Get rows" throws an error you have to open up the package to see what that component does. If the name of that component was prefixed with a code indicating that that component is an OLE DB Source adapter then instantly you can get an idea of what the component is doing and why it might have failed.
Data Viewers And Excel
Data viewers are a great debugging tool but
they are fairly static. As a matter of course the first thing I do whenever
using a data viewer is to copy the data into Excel where it can be manipulated
much easier. The standard grid data viewer includes a "Copy Data" button that
copies the contents of the data viewer (i.e. the current pipeline buffer) to
Store All Expressions In Variables
Expressions are great ways of altering package behaviour at execution-time but it is always helpful to see what an expression has been evaluated as and the ability to do that is not naturally inherent in SSIS.
There's a great way around this though. If you store your expression in a variable and simply reference that variable wherever the evaluated expression is required you will be able to view the evaluated expression in the Watch window of Visual Studio. There is also the added benefit that once an expression is in a variable it can be reused in multiple places.
This is a particularly useful technique if you are using a dynamic SQL statement in an Execute SQL Task or OLE DB Source adapter. Storing all your expressions in variables is a great habit to get into!
Here we show a very simple example of this. The two screenshots show the same expression being evaluated. Once at design-time and once at execution-time.
In this second screenshot you can easily see what the expression is evaluating to and this would not be possible without putting the expression into a variable.
Here we highlight the simple point that an expression can get evaluated differently at execution-time to what it does at design-time and hence it pays to be able to see what that expression is being evaluated as.
I can't really stress this one enough. Making use of package templates is a great way of maintaining consistency in your development practices and is something that ALL SSIS development teams should be using. One of the common uses is to ensure that you are always using the same logging and configuration set-up.
On my last project we also included an annotation in our template to ensure that the package developer was adding the high-level description of what the package was for, just as I have already suggested you should be doing.
If you want to know how to use templates simply search out "SSIS template" on Google and you will find what you need to know.
DontSaveSensitive In Template
The default setting of the ProtectionLevel property of a new package is ProtectionLevel=EncryptSensitiveWithUserKey. If you use this setting then you will experience problems either when another developer tries to edit the package or when you attempt to execute the package via a different user (as is usually the case when executing via SQL Server Agent).
My tip here is to change the property setting to ProtectionLevel=DontSaveSensitive.
In fact, why not make it easy on yourself and put this setting into your package template.
I must caveat this by saying that if you do this then you will have to use configurations in which to store sensitive information. However, if you are following best practice you should be doing this anyway so ProtectionLevel=EncryptSensitiveWithUserKey is not required.
OnPipelineRowsSent is a configurable option in whatever logging options you have chosen for your package. Its really useful for debugging your packages because it tells you exactly how many rows are being output from, and provided to, each component in your pipeline. This is invaluable information if you are not receiving the expected number of rows in a destination adapter.
So do yourself a favour, turn on logging of OnPipelineRowsSent.
Combining this with something I have already said – why not include OnPipelineRowsSent in the logging configuration of your template?
One Configuration Per Connection Manager
In the current version of SSIS a package will throw a warning if a configuration that it is using references an object that does not exist in the package.
The package path referenced an object that cannot be found: "\Package.Variables[User::SQL_string].Properties[Value]". This occurs when an attempt is made to resolve a package path to an object that cannot be found.
This may change in the future but in the meantime you should mitigate the risk of this happening – you do not want warnings popping up here there and everywhere because it makes it harder to identify those unexpected warnings that you DO want to know about..
A good policy to employ here is to only have one configuration setting per configuration file. This means that when building a new package you can cherry-pick which ever existing configurations you want to use and hence promote reusability of a configuration.
Organise Your Configuration Files
The previous bullet point explains a policy
that I follow religiously but it does mean that you can end up with a plethora
of configurations that you are required to manage. For this reason you should
ensure that all your configuration files exist in a dedicated folder.
Taking the idea further, why not employ a
common folder structure on all of your projects? Having separate folders all at
the same level for raw files, log files, configuration files and checkpoint
files simplifies management of your SSIS execution environments.
New Log File Per Execution
If you are using the SSIS Log Provider for
Text Files then by default SSIS will log all events from multiple executions to
the same log file. This makes it extremely difficult to separate one
execution's log entries from the next so it is a good idea to have a separate
log file for each package execution.
There is a very easy way of achieving this. Use an expression on the connection string of your log file connection manager to specify a different log file name for each execution. The name of the file should be relevant to the execution so it is a good idea is to concatenate the value within the @[System::StartTime] variable to the log file name thereby ensuring a distinct name for each execution.
The screenshot here shows how easily this is achieved.
The expression shown concatenates the path and filename with the date in the format YYYYMMDD thus ensuring that log files appear in Windows Explorer in execution order. The expression can easily be extended to include hour, minute and second precision if desired.
That concludes my write-up on some of my experiences and tips gained from implementing SSIS in a real production environment. I hope that some of the points raised here prove useful to you in your SSIS implementations and even if you don't agree with what I have said then it will hopefully have given you some food for thought. What I really want to achieve with this is to promote a culture of sharing experiences so as to derive some best practices that we all can learn from.
If you have any queries or comments on anything that I have said here then please drop me a mail at Jamie@Jamie-Thomson.net.