Best way to narrow results

  • I am trying to figure out the best way to do this...

    I have a table with millions of records for orders. I only need a small subset of that. I only need order records for certain employees. Normally, this would be a simple subquery or even a join like this:

    select * from Orders where employeeID = (select employeeID from Employees where region in (1,2))

    Only the two tables are on different servers, so I should be like this:

    select * from Server1.Database1.dbo.Orders where employeeID = (select employeeID from Server2.Database2.dbo.Employees where region in (1,2))

    But the servers are not linked servers, so I can't do that. I can do a Data Flow Task where I have a source for Server1.Database1.Orders, and a source for Server2.Database2.Employees, and then do a Merge Join, but then I have to pull everything from the Orders database first, which will take forever.

    I thought of pulling the list of employeeIDs, putting them into a variable, and then doing a ForEach loop to pull the required records from Orders, but that does not seem efficient. The reason for that is that I have simplified my example by using employees and orders here, but in reality the number of records returned by my 'employee' table is around 25,000. So the ForEach loop would have to go through 25,000 loops, and that will take forever.

    What I would really like to be able to do is take that variable and add it to the query on the Orders table. So it would be like:

    select * from Server1.Database1.dbo.Orders where employeeED in (?) and map that variable to my list of employeeIDs. But that doesn't seem to be possible.

    Any thoughts on the best way to handle this situation?

  • You could use the lookup transformation. Set the larger table as source and then the filter table in the lookup transformation. Be aware that it will have to load the filter table into cache to be able to compare. You would still read the whole orders table but won't have to sort it to do the merge join.

    Another possibility is to send the Employees table to the server that has the order table. It could be loaded into a different database so you don't have to make changes to the database that you're querying.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thanks Luis, I thought of that too, but didn't really want to do that because it still has to load the whole large table. Unfortunately, I don't have access to add the table to that server/database. I only have read access. And asking for help from the DBA of that server is probably going to not go anywhere... at least it would take weeks or months for even a response.

  • robert.wiglesworth (11/29/2016)


    Thanks Luis, I thought of that too, but didn't really want to do that because it still has to load the whole large table. Unfortunately, I don't have access to add the table to that server/database. I only have read access. And asking for help from the DBA of that server is probably going to not go anywhere... at least it would take weeks or months for even a response.

    Unfortunately, if you can't get a linked server, the only way you're going to achieve this is by loading at least one of the tables in SSIS.

    I just did a little bit of testing (using the StackOverflow Database, as it's pretty big), and doing it this way, the best way seem to be to use a Source which grabs all the orders (just get the columns you need), and the do a Lookup to the Employee table.

    The test I did for Stackoverflow was initially to get all users that had logged this year, and then get their Posts. This caused by PC to try and cache EVERY POST, as that was my lookup (a laughable endeavour, which basically ground my PC to a halt).

    Doing a SELECT ALL from the Post table, with a lookup of SELECT * FROM StackOverflow.dbo.Users U WHERE DATEPART(YEAR, U.LastActivityDate) = 2016 gave pretty quick results. I'll admit it's still running as I type this, but my PC is still running (It has 1.3 Million rows cached right now and isn't struggling with it's tiny 8GB of RAM), and the Database is hpsted on a test Desktop for us to play around on.

    Tl;DR Lookup doesn't really cache everything, just what you need. Create a Dataflow that simply gets everythingfrom your orders table, then do a Lookup to your Employees table where your region = 1 or 2. SSIS will cache only the applicable Employees, and then complete your lookup as it writes out the data to your destination.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply