How to make in memory SQL transformations within a data flow task

  • Hi all,

    first of all, I am new to Integration Services. I googled everything so far, but I do not find any answers on my question. So I hope anybody of you can help me out :-).

    Currently, I am migrating an T-SQL ETL process to SSIS to see the benefits in parallel executions and bulk inserts. The T-SQL codes are written sequential and persist some intermediate tables.

    A simplified example would look like:

    Source tables -> SQL transform -> persisted intermediate table 1 -> SQL transform -> persisted intermediate table 2 -> SQL transform -> target table.

    For the proof-of-concept I do not have the time to rewrite the SQL queries and find equivalents in SSIS. Well, I do not like the idea to persist the intermediate tables in SSIS, too but I want to stream the data to the next transformation. The picture from the following link summarizes exactly what I want:

    https://docs.microsoft.com/de-de/sql/integration-services/data-flow/data-flow?view=sql-server-2017

    The problem: I don't know how to do it in SSIS. I haven't found any transformation step which allows SQL input and passes it to the next transformation step (OLE DB Command is not the solution as it operates on a per row basis what I have read so far).

    A very dirty solution would be to put all in CTEs. I would end up with a monster SQL query what I don't like.

    How do I solve my problem? I need an operator to read the target columns, take them as input, apply some SQL and send the result columns to the next operator.

     

    Thanks for your help!

     

  • You can't. If you want to use T-SQL, you need to use the DB engine (and not SSIS). You'll likely have to rewrite your existing code to make best use of SSIS.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Hi,

    thanks for your fast reply. I thought so as I didn't find any literature. Though, it would be nice to have something like that. There is a script transform. Why not SQL transform? The user could decide the tools to get his data transformed.

    I guess, I need for time to leverage the benefits of SSIS tools. For the sake of time, I will use OLE DB source and past the query into it and write it directly to OLE DB target. And so on and so on. It's not nice but acceptable for the time frame.

     

    Bye

     

  • A SQL transform would require access to a DB engine. To make use of that engine would require the passing of data to the engine and the subsequent retrieval of processed data from the engine. And this is effectively what you are planning to implement 🙂

    As you learn more about SSIS, you will no doubt change the way you think about how to architect your ETL, to take advantage of its features.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Ok, understood. Just thought that I can somehow skip persisting manually the intermediate table.

    Can you recommend some links which provide translation and thinking from T-SQL to SSIS?

    For me, it is currently quite difficult to find the right balance between SSIS tools and getting as much transformed in the OLE DB source object. I found a site with performance tips and it sorts SSIS tools in different performance buckets. It states that "don't use the sort operator! Don't use the merge operator! Do your sorts and joins in the DB OLE source directly". My sources are all DBs, so I can leverage that. I understand that these objects exists to allow such transformations with non-DB objects (XML, CSV, XL(X), etc.).

    So, any links about philosophy and the SSIS's mindset are more than welcome 🙂

  • I don't have any links to hand.

    But what I will say is that you should think of SSIS as a streaming tool.

    Transformations are applied to data as it streams along a processing pipeline.

    Anything which slows down, or temporarily blocks, the pipeline should be avoided if possible. See here, for example, for a list & further details.

    From experience, I can say that the DB engine is almost always faster, and significantly so, when it comes to set-based operations (joins, sorts, grouping, updates). If you need to do those types of operations, and you have many rows of data, consider delegating the work to the DB engine.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • You could make a start by putting your current T-SQL code into Script controls in SSIS then it would work in exactly the same way it does now but would be executed by SSIS. Later you could change them to use SSIS controls.

    Intermediate tables can be very useful for diagnosing any problems with the ETL so it's not necessarily a great advantage to get rid of them. You can also index the intermediate tables to improve performance.

     

  • So when dealing with data transformations at tables with 40-50GByte in size: is it recommended to use SSIS interal operators at all or should I go always for DB Engine Calls from SSIS? I am only dealing with SQL Server tables as input and output. Sorry if this question goes beyond the original question of the thread.

  • svenflock wrote:

    So when dealing with data transformations at tables with 40-50GByte in size: is it recommended to use SSIS interal operators at all or should I go always for DB Engine Calls from SSIS? I am only dealing with SQL Server tables as input and output. Sorry if this question goes beyond the original question of the thread.

    I'm not really expert enough to say which one you should be using. But I've helped develop an ETL system in SSIS and apart from the population of the staging tables which use Data Flow Tasks (as it's getting data from a different database) nearly all the rest of the transforms are Execute SQL Tasks which contain large SQL scripts.

    As you already have the SQL scripts it would be a relatively easy step for you to put these in Execute SQL Tasks within SSIS packages. From there you could convert them one at a time to use the SSIS transform controls and see if it makes an improvement.

  • LoL, at least I profit from the workflow engine and parallel execution, logging and error handling. This is something very difficult to implement, especially to have parallel SQL tasks running and firing the third one when both are finished.

    I will go with that approach and slowly adjusting to SSIS tools

  • I'd be interested to know if the SSIS transforms perform any better or worse than well-written SQL scripts.

  • Jonathan AC Roberts wrote:

    I'd be interested to know if the SSIS transforms perform any better or worse than well-written SQL scripts.

    It's tough to beat well-written SQL scripts, but it's also quite rare that they can be found!

    Achieving a controlled degree of parallelism is easy in SSIS, so if, for example, there are multiple tables which can be populated concurrently, there's a possible, and even likely, gain there. Doing the same thing in T-SQL is not so easy. In most other areas (perhaps not complex string manipulation), I'd expect optimised T-SQL to triumph.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply