LinkedServer - still mystified

  • I have a scenario where there is a linked server (A) that has two tables that I need to join, Accounts_A and Customers_A by Accounts_A.CustomerNumber=Customer_A.CustomerNumber (a customer can have multiple accounts). Each table has 10M or more records. Also, I have a local database (B) that has a subset of Accounts that I'm interested in (about 100k).

    So what's the right way to pull just the accounts I'm inerested in from server B into server A? My goal is to get the latest data from the remote server (our transaction server) for only the accounts I'm tracking locally...along with the most recent customer info for each account.

    Option 1: I've already tried four-part-names and joins and the performance SUCKS...though I'm not sure I fully understand why...other than I'm dealing with over 10M records coming from the remote server and I'm guessing the full set of remote data is cached locally and un-optimized and then joined to the local data. Again, this is a guess on my part as I've only "heard" that this is how joins involving linked servers work.

    Option 2: I've tried staging the data locally using sub-selects for each table I'm interested in. Obviously having the data locally when I'm ready to run my analysis/joins performs better...but it seems like an I/O pain to stage data like this.

    SELECT

    ...

    FROM [Server_A].[DB_A].dbo.Accounts_A AS aa

    INNER JOIN [Server_A].[DB_A].dbo.Customers_A AS ac

    ON aa.CustomerNumber=ac.CustomerNumber

    WHERE aa.AccountNumber IN(SELECT AccountNumber FROM dbo.Accounts_B)

    or maybe with an openquery like...

    SELECT

    ...

    FROM OPENQUERY([LinkedServerA],

    'SELECT .... FROM [Server_A].[DB_A].dbo.Accounts_A AS aa

    INNER JOIN [Server_A].[DB_A].dbo.Customers_A AS ac

    ON aa.CustomerNumber=ac.CustomerNumber') la

    WHERE la.AccountNumber IN(SELECT AccountNumber FROM dbo.Accounts_B)

    Option 3: I was thinking of adding my local server (B) as a linked server on the remote server (A). Then, from the local server (B), use four-part-names in an OPENQUERY that is sent to the remote server (A), essentially limiting the results BEFORE they get brought locally. But this seems like a kludge and might still perform poorly if my local data gets large...or might not even work (haven't tried it yet)....and the owners of the remote server might not want to add a link to my server.

    SELECT

    ...

    FROM OPENQUERY([ServerA],'SELECT...FROM [ServerB].[DB_B].dbo.Customer_B AS b INNER JOIN dbo.[Customer_A] AS a ON b.Customer_B=a.Customer_A')

    Option 4: Eventually, this will all be in SSIS and maybe it will perform better?

    Any advice?

  • I believe you have answered yourself, i assure you that SSIS will be the fastest way to get your data.

    The problem with linkedservers in SQl 2005 is that it first copies the data into a temp table on local server and then to your base table which makes it inefficient.

    If you dont want to use SSIS then try this :

    use bcp out to get data into physical files and then bcp in to get it on your local server, this will be faster than using linkedservers.

  • Thanks for the reply.

    So large amounts of data over linked servers are simply unusable, largely because the data needs to be copied to the local server in the process? Instead, I need to look into staging the data locally before doing any queries? I guess I was hoping there was more elegant solution.

    Maybe someone can help me understand something. In general, how would the following two queries compare, performance-wise:

    Query 1:SELECT ColA, ColB FROM [MyLinkedServer].[MyDB].dbo.[Table1] WHERE ColA=1

    Query 2:SELECT * FROM OPENQUERY([MyLinkedServer],'SELECT ColA, ColB FROM [MyDB].dbo.[Table1] WHERE ColA=1')

    I get that the OPENQUERY in Query 2 is explicitly causing the filter to happen on the remote server BEFORE sending the data over...but so isn't Query 1? What I'm thinking is that maybe I need to understand how/when remote data is pulled locally into temp....assuming that the "pulling of remote data to the local server" is one of the major performance problems with linked servers.

  • Another potential solution is to have a stored procedure on the linked server which accepts a number of accounts and returns the results just for those accounts. The stored procedure will probably amount to less work on the both servers, because joins are done locally and smaller of amounts of data are moved between systems. We've implemented this is a number of cases with very satisfactory results.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • That sounds like it'd perform ok. So how would you get the list of account numbers to the remote sproc?

  • In our setup, both systems are linked servers with respect to each other. I created a view (could just as easily be a table) on the calling system. The called procedure just queries those account numbers (which represents a very small amount of data to be tranferred). There are no joins across systems.

    Briefly:

    System A has the list of accounts for which detail is needed. System B is a linked server to A.

    System B has the detail information. System A is a linked server to B.

    1. A procedure on System A execs the stored procedure on System B.

    create table #catchtable (...

    insert into #catchTable

    exec SystemB.someDatabase.dbo.StoredProc

    2. The stored procedure on System B does the following:

    a. reads the list of accounts from the view (or table) on System A into a temp table (#accts)

    b. builds a clustered index over the #accts table

    c. runs the query, joining the #accts table to the primary data tables

    (so that all joins are done locally on System B)

    d. result set from query is returned to SystemA and stored in #catchTable for processing

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • An alternative to the view/table scheme above would be to pass the list as a parameter to the proc on SystemB. This might be necessary if you couldn't have System A be a linked server to System B.

    The parameter could either be XML or (in 2008) possibly a table variable. I haven't tried table variables as parameters across linked servers. There may be some gotchas there, I need to try that.

    Some of our pre-2008 code uses pipe-delimited varchar(max) parameters to hold lists instead of XML, if all elements are going into a single column. In these cases the called procedure uses a tally table to quickly parse the single parameter into individual values which are stored in a temp table or table variable.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Ah, so the key was to get the local server linked on the remote one. That opens up more options...it was a route I was considering...glad to hear I'm not crazy. Now to hope the owner of the remote server (the one with details) doesn't mind linking back to mine.

    Thanks!

  • If they won't link back, then go with the parameter passing option in my last post. We can discuss that more if you need to. Just let me know.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • @SQL (7/21/2009)

    The problem with linkedservers in SQl 2005 is that it first copies the data into a temp table on local server and then to your base table which makes it inefficient.

    Is this still true with 2008?

    I've been trying to find more documentation on this because I saw some colleges establishing a pattern using the 2008 MERGE statements to do delta processing in a warehouse over linked servers. I told them that I thought it was going copy the entire table over, hence the poor performance. But, I cannot seem find much in the way of proof to show them.

    At any rate, if you have some info or links, that'd be appreciated..

  • Couldn't find anything in BOL, but here's an article discussing the behavior.

    http://www.sqlservercentral.com/articles/Linked+Servers/62246/[/url]

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply