RE: ETL Through Linked Server

SSC Guru

Points: 1003863

December 28, 2014 at 5:34 pm

bantrim (12/28/2014)
Hey everyone,
I am currently trying to optimize an ETL process that pulls around 150 tables from a OLAP-esque system through a linked server. It's the AllScripts Touchworks EMR database, if that makes a difference. I am having a difficult time fitting an incremental load into a reasonable window of time. For instance, one table, which has around 40 columns, and a total of about 50 million rows, takes around an hour just to pull one day's worth of records when filtering on a single datetime field. The query is:
SELECT *
FROM OPENQUERY([linkedserver],
'
SELECT *
FROM Works.dbo.Visit
WHERE StartDttm BETWEEN ''2014-12-23'' AND ''2014-12-24''
'
)
Which, in the end, returns only about 30k records. However, in trying different approaches, I decided to test out bcp using the following code:
bcp Works.dbo.Visit out f:\etl\visit.dat -n -S linkedserver
this extracted the entire table to flat file, but it only took about 15 minutes....add on another 3-4 to load that into the database on my end. Now, my ability to see what's happening on the remote server's end is very limited, but I know that the server has much higher specs than mine. They have 96GB memory vs. my 32GB, twice the cores I've got, etc. So how is this running faster by pulling the entire table over?
I figured that even in the worst case scenario of a full table scan, picking 30k rows out of the table that match a single criteria would certainly run faster than bringing the whole thing over. This isn't isolated to a single table, either. I finished running a test of the incremental extract/load vs. full extract/load via bcp (200GB of data). The incremental takes about 7 hours, bcp takes about 2 hours.
Any ideas on what I'm experiencing or what a better approach would be?

Yes. The BCP thing is an excellent idea but you're using the wrong method. Instead of BCPing out the whole table, use the QUERYOUT option with a query that's nearly identical to the one you used for OpenQuery.

The OpenQuery thing might be able to be execute more quickly. Does the StartDttm column on the remote table have an index on it?

You could also make the BCP go faster if it logged into the remote server directly using a "Trusted" connection so that you don't need to use the login info in clear text.

Last but not least, you've reinvented the poor-mans version of replication. It might be easier if you setup replication on the remote server.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)