The situation described is our production architecture. We deal with this issue on an ongoing basis and have had various successes and failures.
The issue, as we have defined and observed it, is that when you join in a table on a remote server the remote server returns the entire dataset to the requesting server and then that dataset is optimized into the local query. So if I join in a table with 50,000 rows, even though in the ON clause I have filtered to only needing ONE row, the remote server will return a dataset with all 50,000 rows and then the local server will use the dataset in an optimized local query.
So the solutions offered do address the underlying question: How do I get the remote server to only return the rows that I actually want, instead of a complete dataset of the entire table?
Temp tables, OPENQUERY, etc are fine IF you have the filter criteria when you create the temp table or OPENQUERY. But if the filter criteria for the remote is derived from the values of several other columns from several other tables in the query itself, then temp and openquery crumble away as my filter criteria from columns in other tables increases. My understanding is that the remote server is returning a temp dataset any way, so this can only be more effective if you have and can use filter criteria.
So what's a developer to do!??!!??
Here is what we do that has proven results. We use a join clause with a sub-select statement. In the example below TBLA is local and TBLB is remote.
Select TBLA.cola, TBLA.colb, TBLA.colc, TBLB.cola, TBLB,colb From TBLA
inner join (Select cola, colb, colx, coly, colz From Remote.TBLB_Remote) TBLB
On TBLB.colx = TBLA.cola and TBLB.coly = TBLA.colb and TBLB.colz = TBLA.colc
When you use a join with a sub-select statement the remote query is filtered by the criteria in the ON clause of your join and instead of the remote sending back the entire dataset of the table, it only sends back the rows you actually need, but of course it makes a trip to the remote server for every row in TBLA.
If you can put a where clause on the sub-select to further limit the dataset, even better!
If your remote server is colocated in the data center with the local server (as in our case) and/or you have sufficient WAN bandwidth, this method is very efficient.
If the local server is located in Atlanta and the remote is located in Chicago and you have a DSL between locations, this would not be good. But nothing would be good in that case! 😀
In essence you must choose between the load of a large amount of data being transmitted across the WAN against multiple trips with a small amount of data. Usually the latter wins.
Also if you NEED the majority of the rows in TBLB, go ahead and do a regular join and get them and let the local server optimize the remote dataset into the local query. In other words, if I am selecting 50,000 rows from TBLA and I have one or more matching rows in the remote TBLB for every row in TBLA, go ahead and get all the rows from TBLB at one time because your going to need them. In our experience, this is not usually the case.
Just one last point. Our development environment is not distributed. We take all the databases from all of our SQL servers and put them on a single development SQL server. Using the method I described above, when you are not dealing with remote, actually makes the query run slower when you are not in a remote environment. This makes the efficiency of new development queries that are intended to work cross-server difficult to measure.
Just another option to put in the toolbelt!
Well, nuff said.