Exotic Joins in T-SQL and How They Translate in SSIS

  • Comments posted to this topic are about the item Exotic Joins in T-SQL and How They Translate in SSIS

  • This is not a real join since only the first row from the lookup table is returned:-(

    I have had the same problem and my experience is that there is not a general way to do joins in SSIS but your solution is very dependant on the problem at hand. In worst case you have to make a script task.

  • It's only one step to realize that better solution will be move all joins to database with staging area 🙂

  • Interesting. You say that this improved 10 minutes from using a Merge Join - that doesn't surprise me. I'd be more interested in knowing if there is a performance improvement/degredation compared to simply running it as a SQL statement.

    Can I also ask why you chose to implement as a data flow given that you already had it working in T-SQL?

    -Jamie

  • Hi Jamie, thanks for your interest on this topic.

    To answer the second part of your query first, the decision to implement as a dataflow rather than use the T-SQL in a stored procedure or in a SQL task within SSIS is mainly legacy based. Although the ability to change the way the package works is far simpler in SSIS than having to trudge through the SQL. This will also make it easier for future developers to follow on and improve on existing flows within the package.

    On improvement/degradation,

    In my experience and that is not the same as everyone else, the improvement on the SQL was markedly greater, although when developing our initial solution it was done mainly on a workstation. I will look into getting the server stats for you with regard to the differences.

    Nick.

  • Interesting article. I ran into something similar recently, but in our environment, the overhead of the individual select statements was too much for our production environment. I ended up implementing a script task (and started thinking about breaking it out into a new data flow component).

  • Nick Maroudas (2/4/2008)


    Hi Jamie, thanks for your interest on this topic.

    To answer the second part of your query first, the decision to implement as a dataflow rather than use the T-SQL in a stored procedure or in a SQL task within SSIS is mainly legacy based. Although the ability to change the way the package works is far simpler in SSIS than having to trudge through the SQL. This will also make it easier for future developers to follow on and improve on existing flows within the package.

    Very interesting justifications. I wrote an article once comparing these 2 options which you can see here: http://blogs.conchango.com/jamiethomson/archive/2006/03/14/SSIS_3A00_-Data-flows-or-T_2D00_SQL.aspx

    I'd be really grateful if you could add your comments into the comments section there.

    On improvement/degradation,

    In my experience and that is not the same as everyone else, the improvement on the SQL was markedly greater, although when developing our initial solution it was done mainly on a workstation. I will look into getting the server stats for you with regard to the differences.

    Nick.

    I would expect T-SQL to be significantly faster. You're using a row-by-row lookup which will be very slow in comparison.

    -Jamie

  • To answer the second part of your query first, the decision to implement as a dataflow rather than use the T-SQL in a stored procedure or in a SQL task within SSIS is mainly legacy based. Although the ability to change the way the package works is far simpler in SSIS than having to trudge through the SQL. This will also make it easier for future developers to follow on and improve on existing flows within the package.

    Maybe change SSIS task is simple and fast but you don’t forget about software development and ETL other tracks.

    -Find impact changes

    -Forward changes to test environment

    -Multi user development

    -Test changes (fe. blackbox test input – output)

    -Debug and error handling

    -Versioning

    -Deployments task

    -Logic for concurrent loading/ incidental loading…

    … etc.

    Maybe it’s only my opinion but TSQL rules !

  • djedgar (2/4/2008)


    To answer the second part of your query first, the decision to implement as a dataflow rather than use the T-SQL in a stored procedure or in a SQL task within SSIS is mainly legacy based. Although the ability to change the way the package works is far simpler in SSIS than having to trudge through the SQL. This will also make it easier for future developers to follow on and improve on existing flows within the package.

    Maybe change SSIS task is simple and fast but you don’t forget about software development and ETL other tracks.

    -Find impact changes

    -Forward changes to test environment

    -Multi user development

    -Test changes (fe. blackbox test input – output)

    -Debug and error handling

    -Versioning

    -Deployments task

    -Logic for concurrent loading/ incidental loading…

    … etc.

    Maybe it’s only my opinion but TSQL rules !

    I don't think its only your opinion 🙂

    But I can't help but take issue with some of the things that you say there:

    -Find impact changes SSIS has this. Limited, but it has it.

    -Forward changes to test environment If this means its possible to promote code to different environments then SSIS has this too.

    -Multi user development SSIS has this

    -Debug and error handling SSIS has this

    -Versioning SSIS has this (versioning of packages). And unless there's some sort of new feature that I don't know about there is no versioning of stored procs within the database engine

    -Logic for concurrent loading/ incidental loading Not quite sure what this means but with SSIS its possible to load multiple tables concurrently.

    And that's just my opinion 🙂

    -Jamie

  • I didn’t write that it’s missing and I don’t see fault in SSIS only

    Let’s evaluate

    -Forward changes to test environment If this means its possible to promote code to different

    environments then SSIS has this too.

    Promote code to different environments should freeze projects but when you want

    test package you must write\reconfigure connection – potential errors !

    -Multi user development SSIS has this

    It’s for me freezing code for some task . We need to implement a little complex data flow task

    we have two programmers how we could divide jobs for them ?

    -Debug and error handling SSIS has this

    Yes but I prefer try catch statements with nested functionality

    -Versioning SSIS has this (versioning of packages). And unless there's some sort of new feature

    that I don't know about there is no versioning of stored procs within the database engine

    But if you can use not commercial concurrent versioning system it’s maybe problem delegate

    code because xml include some binary data.

    -Logic for concurrent loading/ incidental loading Not quite sure what this means but with SSIS

    its possible to load multiple tables concurrently.

    In real life ETL is very complex and has very dependent task. When something

    is going wrong it’s needed quick find the errors and reprocess flow.

    In some situations the best choice will be ignore error data, how can admin do it without modification

    packages ? Why to the @#!#!! when you run package in agent job you can only see it as one atomic job. I want to say that SSIS package without more flexibly control loading tool (based on metadata) it’s condemned only on run in dubug mode 😉

  • djedgar (2/4/2008)


    I didn’t write that it’s missing and I don’t see fault in SSIS only

    Let’s evaluate

    -Forward changes to test environment If this means its possible to promote code to different

    environments then SSIS has this too.

    Promote code to different environments should freeze projects but when you want

    test package you must write\reconfigure connection – potential errors !

    Not if you're following best practices (i.e. use configurations or apply environmental-specific settings from the command-line). For anyone reading this, be aware that you should NEVER EVER be required to edit packages when moving them through environments.

    -Multi user development SSIS has this

    It’s for me freezing code for some task . We need to implement a little complex data flow task

    we have two programmers how we could divide jobs for them ?

    Fair point. Although I can't envisage any situation where 2 developers have to work on a single data-flow. That's akin to 2 developers working on a stored procedure concurrently and I wouldn't generally allow that. I'm not saying that's correct though, if you have a different approach then I have no issue with that.

    But I strongly believe that not allowing 2 people to edit a data-flow simultaneously is a GOOD thing. Having said that, not allowing 2 people to work on a package simultaneously is a BAD thing.

    -Debug and error handling SSIS has this

    Yes but I prefer try catch statements with nested functionality

    Fair enough. Each to their own.

    -Versioning SSIS has this (versioning of packages). And unless there's some sort of new feature

    that I don't know about there is no versioning of stored procs within the database engine

    But if you can use not commercial concurrent versioning system it’s maybe problem delegate

    code because xml include some binary data.

    Errr OK. Not really sure I understand this one. Are you saying you want to version each task? If so then yes, I think that's a very fair criticism. SSIS doesn't handle that very well at all.

    -Logic for concurrent loading/ incidental loading Not quite sure what this means but with SSIS

    its possible to load multiple tables concurrently.

    In real life ETL is very complex and has very dependent task. When something

    is going wrong it’s needed quick find the errors and reprocess flow.

    In some situations the best choice will be ignore error data, how can admin do it without modification

    packages ? Why to the @#!#!! when you run package in agent job you can only see it as one atomic job. I want to say that SSIS package without more flexibly control loading tool (based on metadata) it’s condemned only on run in dubug mode 😉

    [/quote]

    You say "how can admin do it without modification packages " That's completely true but exactly the same applies to using stored procedures so I don't really understand this one. Sorry.

  • Hi Nick,

    I had some doubts on this range lookup. I had once a reqt for range look up.

    Initially i had implemented the similar logic of joining 1 col from source to one of the cols in the lookup table. it had given me wrong values.

    Later i had to have 2 columns in the source linked to the lowerlimit col and upper limit col. The 2nd col in the source is a copy of the first. And in the advanced tab, i had written the query similar to yours. it had worked for me.

    Can u confirm mapping 1 col from source to lookup table will definitely work.

    Vvek

  • Hi Vvek,

    You do have to have the upper and lower limits within the lookup table for the query in the advanced tab to work. I have now used this range lookup within a few of my packages and the values returned are correct.

    As you are closer to your own data it is hard for me to say what is being returned is correct. I found that by using the lookup against the lower limit, the data returned for me was correct.

    I hope it gives you confidence with your project.

    Nick.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply