• knausk (2/27/2012)


    Is it still that case that any join (heterogeneous to another SQL*Server or to some other DB, say Oracle) via a linked server db table and a local db table will basically ignore any relevant remote table where clause constraints and bring the whole remote table locally before then executing the constraints?

    This is not obvious to track or assess, but temp space starts to fill up and hasn't always been released readily once the query finished.

    It depends on how your query is structured. Yes, it *can* bring more data back than you might think it needs to, but not always. You're right, it's not easy to assess, except if you are working with really large tables and the perf is very good, you know you've set it up the right way to avoid this problem.

    What's the right way? Unfortunately it does seem to be different for different providers, so basically I just spend a lot of time trying things out until I nail something that works. This usually includes pre-filtering the data that comes across by dynamically building the sql statement that's going into the openquery if I can instead of doing all the conditions on "this side".

    That means you end up with: exec sp_executesql @mysql ... and it can be a pain to construct (although, before anybody has a knee jerk reaction to the term "dynamic", it does not add any security risk that wasn't there for some other reason already). But worth it.

    The author's advice to create a local #temp table with the remote data essentially does this latter remote copy explicitly and then does the join locally.

    I have avoided heterogeneous joins between linked servers and find SSIS is vastly faster to accomplish such tasks.

    If you can limit the way the data is brought across, this isn't going to be much faster and it might be slower if you're dragging a lot of data. Which is not to say I don't do it <s>. But often, if the data doesn't need to be "live", instead I'm going to bring it into a *non* temp table, IOW keep it on "this side". And that goes for some heterogenous systems but also some MS SQL-to-MS SQL scenarios.

    This brings up one interesting small bonus of having used the linked server in the first place: suppose you start with the linked server, and then you realize you want to cache to avoid the perf isues. Suppose there are developers going against the linked server, and you now want them to go against the cache. If you use a linked server using the OLE DB Provider for MS SQL, as I indicated earlier,you can actually set up a linked server to itself, pointing at your new cache of the data. This is one way to transition developers with a lot of integration code already in play.

    Later, of course, you can write more elegant interfaces by massaging the cached data according to the integration(s)'s needs and you can move the developers to use these even-more-efficient schemas at their own pace. But the "raw" cached data -- matching the original source -- is still there in your staging tables, and they can access it directly for as long as they need to.

    In my mind, if I am caching or replicating data, whether I use SSIS to do it is not material to the real perf gain I'm going to get; sometimes I use SSIS, sometimes I don't. But in cases where I *must* have the live data, usually SSIS processing is just going to get in the way.

    >L<