Tim Mitchell presented at SQLSaturday #17 on Scripting in SSIS. I thought it was a really interesting presentation, and I’m glad I went. I haven’t done much SSIS work, and very little with scripting, so I learned a few things right off.
Tim is a great contributor here at SQLServerCentral, and I was happy to support him. He made a nice intermediate presentation on scripting in SSIS and I hope he writes a few articles.
The presentation started with a little background on how the scripting works in SSIS. Tim notes that VB.NET only could be used in 2005, but C# support was in SQL Server 2008. He also said that there were two scripting items in SSIS. The Sscript task in the data flow is where his presentation focused.
If the only tool you have is a hammer, everything looks like a nail. He used that analogy and said if you can do something with another task, use that. Don’t need to use scripting for everything,. Scripting works well when you can’t do something with native tools, or when it is very cumbersome.
This is lightweight programming, but it is programming. There can be performance issues as well since you’re essentially writing software. This also can be good job security if you learn to do it well. He mentioned a demand for SSIS jobs, which I’ve seen as well.
As far as debugging goes, there are no good end user feedback from the tools. This can be challenging if you are not a developer.
The Script task in the control flow typically is used for anything except moving data. It can, but usually used to work with OS, filesystem, external programs, variables, etc..
Script component is for data flow, used for data manipulation. Can be useful for branching, thought the presentation didn’t focus on that.
Tim works a lot with hospitals,and patient data. The script component works well dealing with the various unconventional inputs that he sees like:
I haven’t done a lot of SSIS work, so it’s good to see the different real world types of issues people deal with. It might come in handy in the future if I run into one of these situations.
Tim had a nice record import example with characters from “The Office”, using characters and situations. Jim with a pencil stuck in his hand, Meredith’s car accident. and Dwight’s overdose of beets. That made me smile seeing it.
Then he used a script to output to three tables, 3 outputs, from one input. A script component allows n outputs. The outputs are set to the name and datatype for each output table or destination.
The edit script can be intimidating, but it shows real programming code. The CreateNewOutputRows() method that reads the file and loops through the stream. Each row is read in the loop and if it matches a particular value, it is added to the buffer for a particular output. This needs to be done before the values are added to the various elements of the output.
There was no error handling in the script, which I hate to see, but I understand trying to keep things simple. I would prefer to see this included and then skip over/mention the fact that it is in there. It’s easy to comment on TRY..CATCH blocks, but show a good use of the error handling structure.
The second example was the reverse of the first. Take data from multiple sources and combine it into one record. The interesting thing here is that the input needs to be un-synchronized from the input. Not all rows are going to be merged. The third example is to get multiple lines per input. For example, a report type item that describes a row in the db across multiple rows in the report. This was an interesting use of the Script component and basic programming to handle this structure. The fourth example reversed the third.
I thought this was a slightly dry session, and I have sent some specific notes to Tim for future work. Not that I’m a great presenter, but I noticed a few things that could be better. However as a basic introduction to the script component, I thought this was fantastic.
Disclaimer: I read and speak one language, having failed pretty well at learning Latin, Spanish, and then Japanese in my schooling. I'm sure there are more than a few people that would actually say I've not doing too well with English, either!
I've got a few examples here of "data quality" issues that I've seen in emails and posts lately. I don't intent to make fun of anyone, and I'm sure I would make much worse mistakes if I were to attempt to post on a non-English site. Instead I thought these highlighted some great challenges in the data world. First my examples:
"Greet" in response to fixing something.
"I'm thinning about the best way to ..." - A post wondering about a T-SQL query.
"sintax error"
That last one might be easily corrected, and I've seen other errors that are worse (and I can't find right now). But how smart does a routine need to be to decode these types of grammatical issues?
You might think a grammar checked can handle things, but I've written a lot of sentences that Word flags as having an issue, but isn't sure what to do with them. And Word is a free-form application. Imagine if you are trying to do some type of parsing or clean-up of data that isn't constrained with look-up tables?
Data quality is becoming a bigger and bigger issue in our world, and I'm not even sure that we realize it. More and more systems exchange data, and greater amounts of it. As companies seek to work together, and partner to develop new applications, they are merging data between them, depending on employees that aren't always DBAs to somehow match up data. Or they depend on automated systems to "guess" what should go where? I'm not always sure they do a good job matching up data.
And then information is lost.
Not that DBAs do a better job, but I think a human has a better chance of learning from past mistakes and correcting them in the future.
I'm not an ETL expert, but I think there is a tremendous amount of flexibility and power in the SSIS programming model to help you figure out how to best match up data from disparate sources and clean if before it infects your system.