How to join a local table to a remote table by way of a linked server

  • knausk (2/27/2012)


    Is it still that case that any join (heterogeneous to another SQL*Server or to some other DB, say Oracle) via a linked server db table and a local db table will basically ignore any relevant remote table where clause constraints and bring the whole remote table locally before then executing the constraints?

    This is not obvious to track or assess, but temp space starts to fill up and hasn't always been released readily once the query finished.

    It depends on how your query is structured. Yes, it *can* bring more data back than you might think it needs to, but not always. You're right, it's not easy to assess, except if you are working with really large tables and the perf is very good, you know you've set it up the right way to avoid this problem.

    What's the right way? Unfortunately it does seem to be different for different providers, so basically I just spend a lot of time trying things out until I nail something that works. This usually includes pre-filtering the data that comes across by dynamically building the sql statement that's going into the openquery if I can instead of doing all the conditions on "this side".

    That means you end up with: exec sp_executesql @mysql ... and it can be a pain to construct (although, before anybody has a knee jerk reaction to the term "dynamic", it does not add any security risk that wasn't there for some other reason already). But worth it.

    The author's advice to create a local #temp table with the remote data essentially does this latter remote copy explicitly and then does the join locally.

    I have avoided heterogeneous joins between linked servers and find SSIS is vastly faster to accomplish such tasks.

    If you can limit the way the data is brought across, this isn't going to be much faster and it might be slower if you're dragging a lot of data. Which is not to say I don't do it <s>. But often, if the data doesn't need to be "live", instead I'm going to bring it into a *non* temp table, IOW keep it on "this side". And that goes for some heterogenous systems but also some MS SQL-to-MS SQL scenarios.

    This brings up one interesting small bonus of having used the linked server in the first place: suppose you start with the linked server, and then you realize you want to cache to avoid the perf isues. Suppose there are developers going against the linked server, and you now want them to go against the cache. If you use a linked server using the OLE DB Provider for MS SQL, as I indicated earlier,you can actually set up a linked server to itself, pointing at your new cache of the data. This is one way to transition developers with a lot of integration code already in play.

    Later, of course, you can write more elegant interfaces by massaging the cached data according to the integration(s)'s needs and you can move the developers to use these even-more-efficient schemas at their own pace. But the "raw" cached data -- matching the original source -- is still there in your staging tables, and they can access it directly for as long as they need to.

    In my mind, if I am caching or replicating data, whether I use SSIS to do it is not material to the real perf gain I'm going to get; sometimes I use SSIS, sometimes I don't. But in cases where I *must* have the live data, usually SSIS processing is just going to get in the way.

    >L<

  • Regarding using SSIS to acquire data the ease is only proportional to the cooperation of the upstream provider and the immediacy of the data is only as valid as the process that provides it. It has taken 3+ years to create a working relationship with our upstream provider and the politics were much more important (and Machivalian). Also theirs is a 'data warehouse' and created each nigh such that the data is actually 24+ hours old when we get it. It is good to remember that politics sometimes trumps technical priorities.

  • Excellent article on setting up a linked SQL server. Could be interesting to cover some links to other popular DB sources just to round out the topic.

    Only one item stuck out at me as very misleading:

    "The result of executing this query is the same as if all the tables were on the same server, except that the query consumes more server and network resources."

    Yes, more resources are in fact used, but the "result of the executing query" is far from the same. The local and foreign server do not share the same memory space, nor temp DB space. The local platform has little to no reference of keys and indexes that can be applied on the foreign source, and has to leave this up to the foreign server to sort out (unless the dev intentionally builds to use them). This usually results in all rows of the foreign table being ported over (at least for the columns requested) into local memory/tempdb space before joins and criteria are applied, usually in an unindexed fashion (read: Full Table Scan).

    When the tables are on the same server, indexes may be utilized without even directly reading the table in question.

    When it comes to directly queried foreign data, the bulk of latency is nearly always "over the wire", involving multiple processors and I/O paths (minimum two sets), packing/unpacking, and then data to memory/TEMPDB.

    Use of *pre-filtered* TEMP tables is highly recommended. Filter what you want as much as possible and port that into a temp table to be joined upon.

    I've made a living converting old multi-foreign server join PROCs into a fully temp-table-driven paradigm, usually dropping execution time of very large data sets substantially, in some cases from hours to a couple of minutes, and in others from minutes to a few seconds. The front of the PROC gets set up with all of the necessary filtered temp tables, and then the joins are made. Rarely is it ever advisable to use local data joins to any type of foreign server directly in the FROM clause. A few minutes' saved expediency (per join) is rarely excuse for leaving a known time-bomb such as direct foreign-server joins of any type, including linked servers.

    The same goes for creating a linked server to a DB that is actually local. I don't believe the parser in most if not all of the current major DB systems will circumvent a link to query the indexes directly before deciding what to request, and the only time saved in this limited case may be that the data doesn't have to packetize over the network... although I'm certain that can vary depending on the type of foreign link / linked server used at that point.

    The way I usually describe Linked Servers to end developers is by comparing it to MS Access and linking to ODBC connections and then querying by joining across those connections. SQL Server may have larger memory and file size allocated to it, but in principle the overall process is basically the same.

    The use of SSIS has some caveats and some significant overhead, both in development and maintenance/changes. A steady staging environment is definitely stepping up the architectural design of a data system, and greatly desired (I recommend them for all first-base reporting solutions) to keep large reporting queries off of the live OLTP server(s). Being in the ETL space, SSIS (and all other staging technologies) also do not get the proverbial "world" joined directly to them either, and the user-security risks are far fewer than Linked Servers. But these do not come close to outperforming pre-filtered temp tables, especially when immediate-live data is required.

  • The key point being to build the temp table, i.e, the reduced size result set on the remote box then reference that table. Still problematic if the results are driven by elements locally. meaning you don't really know what to pull remotely until you join to the local table. You are right about reduced column sets. But then you are limiting things too, and at that point you really are talking about the actual functionality required.

    I think what it really means is that there is not a really good heterogeneous join system out there, or at least not in SQL*Server though it is probably good enough for most things. You could design a minimal set of data transfer that is pretty close to the least amount of time, space, data transfer possible. But, it gets hard to balance lots of divergent requests. It is like a human has to be the cost optimizer because the built in optimizer thinks fairly simply. Yet the fix could be rule based and so dynamic sql could build itself.

    Is there a contest out there to build the fastest, least packet hungry heterogeneous (throw in most secure) join across some well defined set of hardware/software? and if so, what is the prize? I propose the solution expands bandwidth on the internet as well as moves data across it. Which after all is sort of the same thing...

  • Don't have any experience with linked servers and wonder if anyone could clarify for me a question on the article's script? The article states:

    --POPULATE A TEMPORARY TABLE WITH DATA FROM THE LINKED SERVER

    SELECT StateProvinceID,StateProvinceCode

    INTO #temp

    FROM [SERVER02].[AdventureWorks].[Person].[StateProvince]

    GO

    --JOIN HOST SERVER TABLES WITH THE TEMPORARY TABLE

    SELECT a.FirstName,a.LastName,b.AddressLine1,b.City,c.StateProvinceCode

    FROM [SERVER01].[AdventureWorks].[Person].[Contact] a

    INNER JOIN [SERVER01].[AdventureWorks].[Person].[Address] b ON a.ContactID = b.AddressID

    INNER JOIN #temp c ON b.StateProvinceID = c.StateProvinceID

    GO

    In the above code the temp table #temp is being created on the local server Server01, right? Not on Server02, so the data is being moved to the local first?

    Also, in the second peice of code where the local/host server tables are joined to the temp table, presuming the temp table is on the local host server01 would the query really have to have the fully long justified server names (where SERVER01 is given)? Couldn't the syntax just start with [AdventureWorks].Person.Contact or even Person.Contact?

    Thanks for your help!

  • Working in an environment with multiple security bands, multiple networks and load balanced servers there's no way linked servers are possible, practical, or desired.

    SSIS or some other tool that allows heterogeneous data connects is a good solution to dealing with these situations.

    In a smaller shop linked servers might be the preferred solution but I really haven't seen it done in any of my work environments since we brought in 2005.

  • My company has central database server and all other database servers subscribe to that central server to keep a local copy (nightly replicated). That way I dont need to use linked server queries to read variables from central db server. We are using 3 part queries (db.schema.tablename) on same server. Now managers want to have the central server keep server details, that means make all 3 part queries to become linked server queries.

    I am opposing that but dont have alternatives. Pulling to local table does not work as there are too many tables. In short, managers want to execute everything from central db server by using serverIDs for other servers. Any suggestions to address this?

    Thanks,
    Naveen.
    Every thought is a cause and every condition an effect

  • Naveen PK (7/23/2015)


    My company has central database server and all other database servers subscribe to that central server to keep a local copy (nightly replicated). That way I dont need to use linked server queries to read variables from central db server. We are using 3 part queries (db.schema.tablename) on same server. Now managers want to have the central server keep server details, that means make all 3 part queries to become linked server queries.

    I am opposing that but dont have alternatives. Pulling to local table does not work as there are too many tables. In short, managers want to execute everything from central db server by using serverIDs for other servers. Any suggestions to address this?

    Any query, that involves context switching between multiple database instances and pulling data across the network wire, will perform significantly slower than a query that joins only to tables within it's own instance. Even with optimal indexing and expert usage of query hints, you will still be looking at a 25% runtime increase minimal. The network wire is a bottleneck, and there is no way to get around that, unless you co-locate all the instances on the same physical machine.

    The best way to approach this would be to take one of your more complex stored procedures, retrofit it to use distributed joins, deploy it to production under a different name, and then compare the performance of the distributed join version versus the locally joined version using same parameters. If you can prove that distributed queries will significantly impact your queries in a bad way, then you will create F.U.D. in the mind of the whomever is advocating this idea and hopefully they'll forget the whole thing.

    https://en.wikipedia.org/wiki/Fear,_uncertainty_and_doubt

    Or... maybe after experimenting with it you'll discover it's not a significant impact after all.

    However, if management is hell bent on going down this path of distributed joins regardless of the impact, then there are ways to mitigate the issue. For example, you can use OPENQUERY to pull the remote recordset you need into a local temp table, and then join your local tables to the temp table.

    http://blogs.msdn.com/b/sqlsakthi/archive/2011/05/09/best-performer-distributed-query-four-part-or-openquery-when-executing-linked-server-queries-in-sql-server.aspx

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Yes Eric, it is worth experimenting to find the performance issues before saying yes or no. I would triple (0r based on growth rate) the data in some large tables to ensure that this works when data grows in future.

    Thanks,
    Naveen.
    Every thought is a cause and every condition an effect

  • Stan Kulp-439977 - Sunday, February 26, 2012 9:47 PM

    Comments posted to this topic are about the item How to join a local table to a remote table by way of a linked server

    hi,
      this is nice article thanks for it.
      but i would like to know ,
                               How to join a local table to a remote table by way of a linked server using EXEC AT Update query.
    thanks in advance

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply