RE: capturing Error Column – SQLServerCentral

SSC Enthusiast

Points: 179

February 21, 2008 at 10:49 am

First off, I want to say that I’m quite impressed by the power and flexibility of SSIS, I just wished there was a better way to track data errors and to know which row and column caused the error.

To perform this type of error logging, I was using the Derived Column task to add addition metadata (i.e. Package Name, Task Name, Date and a custom error message to help identify which row caused the error), the Script Component task to get the error message and then a Destination task to store the data about the error. I added these three additional tasks to every major step in a Data Flow. This usually meant that I added on average 30 additional tasks per Data Flow (3 error logging tasks per 10 main steps).

This method of reporting data errors works fine, but it is very tedious to develop and maintain. The three additional tasks are identical for every main task that error logging is needed. Also, every time the metadata changed, the ETL Developer will need to refresh the metadata of all the additional tasks. Another shortcoming of this method of logging errors is the fact that the task’s internal error handling only reports the column lineage Id (ErrorColumn), not the column name of the column causing the error.

The main reason that the column Lineage Id can not be resolved is because it's not in the error pipeline. The error pipeline consists of the previous task's inputs plus two additional columns (ErrorCode and ErrorColumn). The column causing the error is probably created (or derived) in the previous task and then added to it's output collection.

In my search for a solution, I noticed that this was a common request among many ETL Developers. I ended up creating a custom task component to log data issues. The main concept is having only one custom component and connecting all tasks that have an error output path to the custom task component. When an error occurs, the component first looks in the input path of the error in hopes to also get the value causing the error (I have yet to have any data from this), then traverse all the other input paths to resolve the Column Name from the ErrorColumn. The name is resolved from a task that is down stream in the main pipeline. Another advantage to this approach is that I only have one task to configure per Data Flow instead of 3 per main task. This decreased our ETL development time by 80%.

I hope this explains why it’s hard (or impossible) to get the Column Name with a script task and possibly a direction to look to get it.

I referenced Chapter 14 of Professional SQL Server 2005 Integration Services to build the custom component.