Incremental load from a linked server.

Question

Incremental load from a linked server.

qifai3zoi3ziq

SSC Journeyman

Points: 85
More actions
March 31, 2014 at 11:41 am

#305893

Hi, I need to do incremental loading from a source (a linked server in my db) to a destination which is local db.
I am planning on doing ssis package by performing joins (restricted left outer join) to get the newly inserted values but how do I perform join with a linked server? When I write a join query in OLEDB command, only one connection can be chosen either the linked server or the local one (they both are also located on different server). For example I have two servers, server 1 and 2. Linked server(source) is under server 1 and local server(destination) is under server 2. Also, my fact tables in the linked server(source) do not contain any unique key so I might have to join on multiple columns to get the unique record. Is it okay to not have a unique key? What is the efficient way of performing incremental load from a Linked server (on server 1) to a local db(on server 2). Let me know if you need further details, I'll explain.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 1

qifai3zoi3ziq (3/31/2014)
Hi, I need to do incremental loading from a source (a linked server in my db) to a destination which is local db.
I am planning on doing ssis package by performing joins (restricted left outer join) to get the newly inserted values but how do I perform join with a linked server? When I write a join query in OLEDB command, only one connection can be chosen either the linked server or the local one (they both are also located on different server). For example I have two servers, server 1 and 2. Linked server(source) is under server 1 and local server(destination) is under server 2. Also, my fact tables in the linked server(source) do not contain any unique key so I might have to join on multiple columns to get the unique record. Is it okay to not have a unique key? What is the efficient way of performing incremental load from a Linked server (on server 1) to a local db(on server 2). Let me know if you need further details, I'll explain.

If I am getting this right: A unique key should be a natural key, not a surrogate key, which means comparing all columns making up the key. A substitute for a natural key could be a hash of all key columns, sometimes more effective than joining on tens or hundreds of columns. It is fine and normal not to have a surrogate key but without a natural/combination key, this does not work.

A simple way could be using the MERGE statement, grab everything from the source with a "timestamp" less than last "timestamp" at the destination, into a staging table. Or a Lookup in the dataflow, where only the unmatched are inserted.

Hope this helps 😉

qifai3zoi3ziq SSC Journeyman Points: 85 More actions · Answer 2

Eirikur Eiriksson (3/31/2014)

If I am getting this right: A unique key should be a natural key, not a surrogate key, which means comparing all columns making up the key. A substitute for a natural key could be a hash of all key columns, sometimes more effective than joining on tens or hundreds of columns. It is fine and normal not to have a surrogate key but without a natural/combination key, this does not work.
A simple way could be using the MERGE statement, grab everything from the source with a "timestamp" less than last "timestamp" at the destination, into a staging table. Or a Lookup in the dataflow, where only the unmatched are inserted.
Hope this helps 😉

The source table does not contain any natural/unique /surrogate key. So I have no choice but to join it on multiple column to find a unique record.

I'm not familiar with merge statement can u explain or point me to an article that can explain this.

When u say timestamp, does that mean that the source table needs to have a timestamp column? I'm not allowed to make any modification to the source table so as of right now it doesn't contain any metadata column that can help me identify a new record. Please also help on how to join with the linked server which is located on a separate server. Thanks!

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 3

qifai3zoi3ziq (3/31/2014)

Eirikur Eiriksson (3/31/2014)
If I am getting this right: A unique key should be a natural key, not a surrogate key, which means comparing all columns making up the key. A substitute for a natural key could be a hash of all key columns, sometimes more effective than joining on tens or hundreds of columns. It is fine and normal not to have a surrogate key but without a natural/combination key, this does not work.
A simple way could be using the MERGE statement, grab everything from the source with a "timestamp" less than last "timestamp" at the destination, into a staging table. Or a Lookup in the dataflow, where only the unmatched are inserted.
Hope this helps 😉
The source table does not contain any natural/unique /surrogate key. So I have no choice but to join it on multiple column to find a unique record.
I'm not familiar with merge statement can u explain or point me to an article that can explain this.
When u say timestamp, does that mean that the source table needs to have a timestamp column? I'm not allowed to make any modification to the source table so as of right now it doesn't contain any metadata column that can help me identify a new record. Please also help on how to join with the linked server which is located on a separate server. Thanks!

A timestamp is in fact anything which guaranties the order of appearance. It can be an Identity column or a date/time of the insert. It has nothing to do with the timestamp/rowversion type in SQL Server, sorry for the confusion. Now few questions, how many rows, columns, frequency, the more information you provide, the better.

😎

qifai3zoi3ziq SSC Journeyman Points: 85 More actions · Answer 4

Eirikur Eiriksson (3/31/2014)
A timestamp is in fact anything which guaranties the order of appearance. It can be an Identity column or a date/time of the insert. It has nothing to do with the timestamp/rowversion type in SQL Server, sorry for the confusion. Now few questions, how many rows, columns, frequency, the more information you provide, the better.
😎

About 10 to 20 columns. And the biggest table exceeds to about 1 million+ records.

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 5

If you have no sequence or date/time to go by, then unfortunately you have to grab all the source records for comparison. With these sizes, which sound manageable, my suggestion is to download the whole shebang onto a staging environment and process it there. Look into HASSHBYTES and MERGE as a starting point, let me know if I can be of any further assistance, 🙂