SSIS advised over stored procedure for this scenario?

Question

Post reply

SSIS advised over stored procedure for this scenario?

matt.newman

Hall of Fame

Points: 3910
More actions
December 8, 2011 at 2:13 pm

#261365

Current setup is as such:
Database A,B, and C on single Raid 10 volume.
DataWareHouse database on separate Raid 1 volume.
SSAS, SSIS, and SQL Engine services all reside on single server.
36 GB RAM
MSSQL 2005 SP4
Data is migrated from the database set A,B,C to the DW database using a stored procedure on a nightly scheduled job. It works. It has decent speed.
I intend to re-architect the data warehouse that will use pulling of data, that will most definitely take longer to complete using the stored procedure.
I have been looking at and learning SSIS but I see the SSIS setup doing the same as a stored procedure would.
What is(are) the benefit(s) of using SSIS over a stored procedure for this example above?
Keep more things out of the SQL Buffer Pool that relate to the data pulled? - Seems like one way or another a query must be done.
Without having setup the entire system to see which is faster or how the server is affected, I keep looking at SSIS as a way to do the same thing, but with more complexity and advanced learning for a developer looking at making a change to the data warehouse load process.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 1

matt.newman (12/8/2011)
DataWareHouse database on separate Raid 1 volume.

Like living dangerously, I see? Might I recommend at least RAID5 so you can survive a single drive outage?

I intend to re-architect the data warehouse that will use pulling of data, that will most definitely take longer to complete using the stored procedure.

Why? Without information it's hard to make a rational suggestion with very generic descriptions.

I have been looking at and learning SSIS but I see the SSIS setup doing the same as a stored procedure would.

It depends on how you use it and manipulate the stream. I've found that a lot of translation work can be sped up using SSIS lookup components, especially when you have dependent translations which can be more quickly handled in a datastream then via multiple update passes.

What is(are) the benefit(s) of using SSIS over a stored procedure for this example above?

There's not enough information on the example other than 'I'm moving data around'. In that case unless the volume is tremendous and you'd like some automated buffer controls, they're pretty much the same.

Keep more things out of the SQL Buffer Pool that relate to the data pulled? - Seems like one way or another a query must be done.

It's either in the SQL Buffer or it's in the SSIS Buffer (external to the SQL Memory allocation). the SQL Buffer wouldn't need to hold onto it as long once it's passed to the data stream, but it would need to be re-injected for the delivery location.

Without having setup the entire system to see which is faster or how the server is affected, I keep looking at SSIS as a way to do the same thing, but with more complexity and advanced learning for a developer looking at making a change to the data warehouse load process.

SSIS is becoming a tool in its own right, similar to SQL Server, and optimization patterns can be a bit different. In the end, for total optimization, it depends on *exactly* what are you doing and what you're expecting to get out of it. If it's on the same server, and is doing simplistic work, cross-database procs are your most likely best bet.

If you want to discuss your process in detail end to end, there's probably places where SSIS will perform better and T-SQL will perform better, and you'll want to perform a dual-tool process.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

matt.newman Hall of Fame Points: 3910 More actions · Answer 2

I do live dangerously but not carelessly, and the reason for re-architecture is as such:

The current data warehouse is made up of two tables. DimDate, and a flattened out "Item" level table. There is no hierarchy in one to many relationships, there is no security for certain data to certain areas in the departments, there is really nothing other than one big table dumping where everything got merged out into a flat system. The ability to drill down into multiple chains is limited into whatever you see in the table. It's possibly seen as a WYSIWYG DW.

This all links to one and only one report through a cube that processed the DimDate and other table together. The Report using the Cube, although used regularly, has a counterpart outside of the DW that could get nearly the same data anyway through a embeded query inside of a MS Office application. It's on a Raid 1 because it was developed recently (before I got here) and pushed onto a fast and separate array than the other data is, but all run from the same host. If the drive went down and the report was not able to be reached, it would not have a huge loss recovery need in the immediate. Overall, it's useful, but incomplete from what it could be. I would be looking to up the drive structure for the cube and DW DB once the project is complete. Until then, I have my testing all done on a separate box anyway and am not throwing money at it on production. Hope that all makes sense.

There is quite a bit of aggregation, but it is primarily controlled by functions that hold the rules and are consumed by other reports that are running against the OLTP system for current data. The DW has an age of 2 am last insert and has some inconsistency from the OLTP system data from some other reasons I am looking to overcome in the restructuring.

The database will continue to reside on the same node. I had though SSIS was integrated in the server engine enough where it took the same buffer resources. I knew SSAS was independent, but did not know what to allocate for memory to SSIS processing. I would like to avoid SQL engine buffer pool dumps and any IO I can.

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

matt.newman (12/8/2011)
It's on a Raid 1 because it was developed recently (before I got here) and pushed onto a fast and separate array than the other data is, but all run from the same host. If the drive went down and the report was not able to be reached, it would not have a huge loss recovery need in the immediate. Overall, it's useful, but incomplete from what it could be. I would be looking to up the drive structure for the cube and DW DB once the project is complete. Until then, I have my testing all done on a separate box anyway and am not throwing money at it on production. Hope that all makes sense.

Yup. As I'm sure you're aware but just for anyone else playing along, make sure your backups are on a different array, or you're hosed even after a drive rebuild. 🙂

There is quite a bit of aggregation, but it is primarily controlled by functions

I assume you mean iTVFs here?

that hold the rules and are consumed by other reports that are running against the OLTP system for current data. The DW has an age of 2 am last insert and has some inconsistency from the OLTP system data from some other reasons I am looking to overcome in the restructuring.

That sounds like quite a task, due to the underlying foundations being disparate like that.

The database will continue to reside on the same node. I had though SSIS was integrated in the server engine enough where it took the same buffer resources. I knew SSAS was independent, but did not know what to allocate for memory to SSIS processing. I would like to avoid SQL engine buffer pool dumps and any IO I can.

No, SSIS definately uses its own Memory resources and buffers, similar to SSAS and SSRS. Different Services, different memory pools (SQL Agent too, if I remember correctly, it just uses so little it's a non-issue). I would keep this in T-SQL personally, from the general description you have. The devil's in the details and without going over all the sub-functionality piece by piece I'd have almost no chance to make solid recommendations of what to put where. Since your functions are already T-SQL and you probably won't gain a tremendous amount calling them directly to SSIS but having to rebuild the components that make sense there, I'd leverage what you already have.

It all depends on just how performant you need this to be and if the T-SQL alone will be 'good enough'. The more I learn about the data stream and the deductions you can make about optimizing what where the better I get at it, but it's very 'The Devil's in the Details' kind of optimization. If you're not comfortable with it and don't have the time to seriously invest in it, I'd pursue the T-SQL route until you reach a point where you just can't achieve the performance you need, then look into other avenues, such as SSIS or significant re-architecture.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

matt.newman Hall of Fame Points: 3910 More actions · Answer 4

Table valued functions, yes, not scalar/inline.

Thanks for the thoughts on it. The devil is in the details, and unfortunately this is more strategics for what could potentially come. The project is hanging at the moment. I am trying to get some ideas on how to architect quickly and easily before the word is they need this entire system inside of two weeks and they need it yesterday.

Thanks again!

GSquared SSC Guru Points: 260824 More actions · Answer 5

Kraig, RAID-1 uses mirroring. It can survive a single drive failure. Can't survive both drives failing, but neither can RAID-5. Multi-drive failure survival requires RAID-6, or RAID-51 (mirrored RAID-5), or one of the proprietary RAID arrangements that certain SAN/NAS solutions use. But RAID-1 isn't any more "living dangerously" than RAID-5 is. It just gives less room (half instead of N - 1), but can be done on 2 drives instead of minimum 3 (like RAID-5).

On the original point, I don't think you'll end up with a significant advantage using SSIS over a stored procedure unless you're setting up incremental loading. It can avoid DTC, even without incremental loads, which can give a significant performance improvement in some cases, so it's worth testing for that.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 6

GSquared (12/13/2011)
Kraig, RAID-1 uses mirroring. It can survive a single drive failure. Can't survive both drives failing, but neither can RAID-5.

I'm going to go grab a glass pipe and go back to smoking it. My brain locked onto RAID 0 and ran to the fences with the ball. Thank you for not letting me continue to look like a damned fool on that point.

On the original point, I don't think you'll end up with a significant advantage using SSIS over a stored procedure unless you're setting up incremental loading. It can avoid DTC, even without incremental loads, which can give a significant performance improvement in some cases, so it's worth testing for that.

It depends on if you have dependent transformations (IE: Transform 1 must complete before Transform 2 can start) . If you have non-dependent transformations you're right, there probably won't be much difference as long as you can do it in a single pass. The big deal comes with how often you have to touch the data. In a datastream you can do it all before the first write as the data goes by, once it's laid down you're looking at one (or more) updates that have to re-touch the rows.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

GSquared SSC Guru Points: 260824 More actions · Answer 7

Evil Kraig F (12/13/2011)
GSquared (12/13/2011)
On the original point, I don't think you'll end up with a significant advantage using SSIS over a stored procedure unless you're setting up incremental loading. It can avoid DTC, even without incremental loads, which can give a significant performance improvement in some cases, so it's worth testing for that.
It depends on if you have dependent transformations (IE: Transform 1 must complete before Transform 2 can start) . If you have non-dependent transformations you're right, there probably won't be much difference as long as you can do it in a single pass. The big deal comes with how often you have to touch the data. In a datastream you can do it all before the first write as the data goes by, once it's laid down you're looking at one (or more) updates that have to re-touch the rows.

Very true.

But the main reason I find for using SSIS for this kind of thing is that it has much more robust error handling, and better flow control, and for its ability to operate outside the SQL Server scope (e.g.: file system operations, or consuming web services, or FTP) without creating security holes in the database. In terms of a simple dataflow between SQL Server instances, SSIS is, in my experience, rarely more performant than T-SQL. (N.B.: "rarely", not "never")

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon