Data Transfer queries

  • Hi there,

    If I could advise on my situation at present. I have been asked to create a Datawarehouse soloution and it is my first ever attempt at this.

    I am using SQL 2012.

    We have a software that is currently hosted for us this means up until recently we do not have access to any of the data. Going forward the supplier of this software is to send us a nightly backup of the database. This is to be uploaded to an SFTP site as a .BAK file. This will be done nightly.

    At the moment the only process I have in place is taking this .BAK file moving it from the SFTP and then restoring it every night to a database called "test_stage".

    What I then want to do is transfer the tables from "test_stage" to another database "test_cdc", at the moment thought every time I am transfering the data it is duplicating rows.

    Example in a "client" table in "test_stage" I have two rows. If I run my SSIS package twice I get 4 rows as it duplicates data. How do I get it to only import stuff that doesn't already exsist is my destination "test_cdc"?

    The plan going forward and again I have no idea if this is the best plan but as my "source database" is getting restored everynight, Change Data Capture won't work as I guess the CDC tables will be lost if I was to enable data captue on these tables.

    Ideally I want to take data from "test_stage" and ETL it to test_cdc. This would not duplicate data...then I would run CDC on this database and incrementally just move the modifications of data to another db called "test_dw" which will be the actual database that users report on.

    Is the above even possible - am I over complicating things?

    Little bit lost what to do for the best.

    Thanks

  • when you want to transfer data from test_stage to test_cdc, do you want to do any transformations or just move data from one environment to other?

    If you are just moving data, you can always use primary key/last updated date column in each table from test_cdc to know how much data you have and configure your ssis to always pull data after last updated date/max(primary key value).. Always pull last updated date/max of it in a variable and pass it to a query to pull incremental load

    e.g. you can put below query in source of your data flow task select * from tableA where primaryKeyValue > @[User:LastMaxPrimaryKey]

    Note: This will always pull only new rows but not the updated rows..

    :rolleyes:

  • You can opt to delete all the existing data in your final table(s) and then import everything fresh. This would eliminate the problem of duplicate records. That or you could only import the records that have changed or don't exist but that would be a lot more work.

    Mark

  • Forgive me if I have oversimplified or misunderstood the issue, but this is how I understand it:

    You are getting a full backup of your DB nightly (test_stage) and you wish to capture all data modifications (new and updated records). Only these data modifications should be applied to your reporting database (test_dw). If this is what you are after then great - you don't really need to implement Change Data Capture to accomplish this.

    But if you need to capture the actual changes (Bill Smith was inserted, updated to Bill Smyth, updated to B Smyth, updated to Bill Smith, Bill Smith was deleted) then it will be a bit more complicated than I have the time/expertise/confidence to explain in a forum post.

    So I will hope the first scenario is correct and attempt to offer one possible solution to the issue (at a high level). If this is indeed what you need to accomplish and you need more detail please let me know.

    From 10,000 feet as I understand it, your existing package(s) restores the backup (test_stage) and "copies" (inserts?) the data from test_stage into test_cdc. I would suggest that you add Lookups to check if the data you are processing from test_stage exists in test_cdc before inserting. Then you can direct the matching rows to be processed one way (i.e. update/ignore - whatever your requirements dictate) and direct the non-matching rows (i.e. new records) to be inserted into test_cdc. Again - this is a 10,000 foot solution and if you believe it might satisfy your issue I'll be happy to get into the details with you.

    Hopefully this was helpful - if not, sorry! Best of luck!

  • sdevanny (7/10/2013)


    Forgive me if I have oversimplified or misunderstood the issue, but this is how I understand it:

    From 10,000 feet as I understand it, your existing package(s) restores the backup (test_stage) and "copies" (inserts?) the data from test_stage into test_cdc. I would suggest that you add Lookups to check if the data you are processing from test_stage exists in test_cdc before inserting. Then you can direct the matching rows to be processed one way (i.e. update/ignore - whatever your requirements dictate) and direct the non-matching rows (i.e. new records) to be inserted into test_cdc. Again - this is a 10,000 foot solution and if you believe it might satisfy your issue I'll be happy to get into the details with you.

    Thanks for everyone that has got back to me.

    The above scenario is basically what I want. It would appear that the business does not need the level of reporting that CDC gives you so all I want to do is transfer the "new" records from test_stage to test_cdc.

    I have now created a simple package.

    It has 1 Data Flow Task.

    Within that Dataflow task I have -

    1 Source Assistant

    1 Lookup

    1 Destination Source

    The Source Assistant is connected to test_stage and looks at the CLIENT table.

    Two columns in this database - client_id and client_name

    This attaches to Lookup.

    Full Cache code, OLE DB connection manager and Redirect rows to no match output,

    On connection manager this looks at test_stage and the CLIENT table.

    Columns the client_id from Available Input Columns is connected to the client_id in Available Lookup Columns

    This is then connected to the Destination Source, which uses a connection manager to test_cdc abd a table called CLIENT

    On mappings client_id and client_name are mapped to the coprresponding table.

    I run the package and all 12 entries from client_ref in test_stage go to test_cdc. Brilliant.

    I run it again and 12 entries copy again. This is what I am trying to avoid. How with the lookup do I just copy what doesn't exsist in the destination table?

    Thanks

  • I'm not sure why you're lookup is directing Matching Output to your destination...everything you describe seems as it should be.

    For your particular scenario, this is how I would set up my Lookup:

    General:

    Cache mode - Full cache

    Connection type - OLE DB connection manager

    Matching entries - Redirect rows to no match output

    Connection:

    OLE DB connection manager - [the OLE DB connection manger of your destination]

    SELECT Use results of an SQL query - write a query that selects only the field(s) you need to get the granularity (uniqueness) you need from your destination table. For your case it sounds like you need something akin to select id from [destination table].

    Columns:

    drag a connection between the Available Input id to the available Lookup Columns id

    Then make sure that when you drag the data path to your destination that the Input Output Selection dialog box it is set like:

    Output - Lookup No Match Output

    Input - [the default should be your destination that you dragged the path to]

    The only other thing I can think is maybe throw a data reader onto the Lookup No Match Output path and see what values are going through and manually compare the values by querying the destination table. Perhaps the id (or another field if you were using the table instead of a select in your lookup) is different?

    You could always add another destination for Matching Ouptut just for Debug purposes to see the behavior.

    Hope that helps - good luck!

  • Thank you so so much.

    I really appreciate the time you have taken to reply to me.

    This is EXACTLY what I'm after.

    But now I'm thinking about it, this won't be useful if the client_name changes in the test_stage database. This would then mean that it won't import becasue the client_id already exsists in the destination test_cdc.

    May have to rethink this after all your time 🙁

  • just a thought here...

    Direct your Lookup Match Output to another Lookup - use the same setup as you did before ut this time look up and slecect both the id and name. Direct your Lookup No Match Output to an update...or another staging table and do a set-based update in an Execute SQL Task. I'll explain the differences below

    If you don't mind row-by-row processing, map your non-matching output of the second lookup to an OLE DB Command and write and update to update the name based on the id. This will do in a pinch but will take a really long time if you have several records.

    I suggest creating a staging table and directing the non-matching output of the second lookup to be inserted into it.

    Then back in the control flow, after the data flow transfor use an execute sql task to perform a set-based update on your destination table from your staging table joining on id.

    If you use the staging table method, be sure to put another Execute SQL task before your data flow task to truncate the staging table 🙂

    Good luck!

  • Look at Slowly Changing Dimensions.

    I scanned the thread but didn't see it mentioned. The SCD wizard will walk you through the steps required.

    What is the source

    What is the primary key to identify an existing record

    What changes do you want to capture

    Do you want to keep historical versions of the records

    What fields should cause an error if they change (e.g. supplier ID)

    SCD will not however pick up physically deleted records as it only compares records from source with records in target so if it does not exist in source it will not get compared. If the record is updated with a flag to identify it as deleted rather than being actually deleted then it will be OK. SCD can be slow because it will compare all columns in all records. You can speed it up if there is a field you can use to define a Delta (e.g. LastModifiedDateTime > today - 5 days)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply