Loading change data from AS400 (or other source systems)

Question

Loading change data from AS400 (or other source systems)

shannonjk

SSCarpal Tunnel

Points: 4286
More actions
December 5, 2012 at 10:52 am

#271209

Greetings all!
So I will keep this concise as I can; I have a source as400 system I am extracting data from for a Data Warehouse. The standard method is to load data from this system into staging tables, then do further ETL from there into the Data Warehouse. The current method to load the staging tables is to wipe all data (truncate) and load fresh so it captures deletions, updates, and additions.
Problem is, the as400 is now in a hosted environment, and the data is growing quite large so loading these 30Gb of data.
So my question is....is there a way to directly query the data at the source for changes, updates, and deletions and handle this dynamically without wiping and loading every night?
I am fairly proficient with BIDS, but to be honest I do not know which method to use in this situation so any help is greatly appreciated!
Link to my blog http://notyelf.com/

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply

robert.gerald.taylor SSChampion Points: 13133 More actions · Answer 1

shannonjk (12/5/2012)
So I will keep this concise as I can; I have a source as400 system I am extracting data from for a Data Warehouse. The standard method is to load data from this system into staging tables, then do further ETL from there into the Data Warehouse. The current method to load the staging tables is to wipe all data (truncate) and load fresh so it captures deletions, updates, and additions.
Problem is, the as400 is now in a hosted environment, and the data is growing quite large so loading these 30Gb of data.
So my question is....is there a way to directly query the data at the source for changes, updates, and deletions and handle this dynamically without wiping and loading every night?

This is a question about the capabilities of the data/database on your source system. Do you have Modified Date and Create Date fields on all of your tables that are correct & trustworthy in the source data? Do you just mark a record with a deleted flag? or does it actually get deleted from the db? (Not a problem either way, you just need to handle the two situations differently).

I don't know your source system; but if you can identify the changes accurately on your source system, then we can talk about ETL steps for incrementally loading/updating your staging tables.

HTH,

Rob

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 2

Hi Robert thank you for your response!

To answer your questions; Yes there are Modified dates and Created dates (In our system they are called Change Date and Entry Date, but all a matter of semantics ;-))

The records that are deleted are physically deleted.

Link to my blog http://notyelf.com/

robert.gerald.taylor SSChampion Points: 13133 More actions · Answer 3

shannonjk (12/6/2012)
Hi Robert thank you for your response!
To answer your questions; Yes there are Modified dates and Created dates (In our system they are called Change Date and Entry Date, but all a matter of semantics ;-))
The records that are deleted are physically deleted.

So you know which records need to be added to your warehouse because their entry dates are since your last ETL run. You know which records need to be updated in your warehouse (or deleted and reloaded) based upon the change date being since the last ETL run. For deleted records, you need to find all business/alternate keys in you warehouse that are not in the source system (AS 400) and then remove from the warehouse.

Piece of cake. 🙂

HTH,

Rob

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 4

Robert,

Thank you for the response however, I am looking for the actual method involved in SSIS, not whether it is possible or not.

I am aware of how to query data based on the entry dates and change dates. If it were a simple matter of throwing together some SQL then I wouldn't have needed to post this question ;-). Perhaps I should have stated my original question better though 🙂

I am concerned with the HOW on this; as in what data flow items do I use to achieve monitoring the absent keys, newer entry dates, and altered change dates? Or if there is a good link that explains this methodology?

Link to my blog http://notyelf.com/

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 5

I guess my question would be, why do you need to use SSIS for this task? Why not just grab the data using OPENQUERY from a linked server with the correct criteria and do a fairly typical and simple "upsert" from the staging table?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 6

Jeff Moden (12/7/2012)
I guess my question would be, why do you need to use SSIS for this task? Why not just grab the data using OPENQUERY from a linked server with the correct criteria and do a fairly typical and simple "upsert" from the staging table?

From that logic, why even bother developing SSIS in the first place?

I can run queries against the source using a linked server but it is much slower than using SSIS (naturally, since SSIS has engines designed for ETL) so using this method would just make things slower.

Also, according to Microsoft Library, and OPENQUERY cannot accept variables for it's argument which would defeat the purpose of this.

Last but not least I am using the parallelism ability of SSIS to load multiple concurrent streams of data. Using SQL scripts kind of defeats this purpose.

I am not sure why no one is really answering my question. I will go back to google and see if I can find something.

Link to my blog http://notyelf.com/

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 7

shannonjk (12/7/2012)
Jeff Moden (12/7/2012)
I guess my question would be, why do you need to use SSIS for this task? Why not just grab the data using OPENQUERY from a linked server with the correct criteria and do a fairly typical and simple "upsert" from the staging table?
From that logic, why even bother developing SSIS in the first place?

That IS where I was going with this. 🙂

I'm not trying to have an argument with you here, Shannon... I just want to let you know what is fairly easily possible.

I can run queries against the source using a linked server but it is much slower than using SSIS (naturally, since SSIS has engines designed for ETL) so using this method would just make things slower.

While I agree that writing a query directly against an AS400 Linked Server will be slower, I've found that OPENQUERY runs quite fast in comparison. I've not done a deep dive on why but, since I can use AS400 locking hints in the query, I believe it's because it throws the query "over the wall" and actually executes on the AS400 itself and simply passed the result set back.

Also, according to Microsoft Library, and OPENQUERY cannot accept variables for it's argument which would defeat the purpose of this.

Correct. It can't... directly. Without much complication, though, a bit of dynamic SQL goes a long way here. I do it all the time.

Last but not least I am using the parallelism ability of SSIS to load multiple concurrent streams of data. Using SQL scripts kind of defeats this purpose.

Not really. Using T-SQL, you can tell SQL Server to asyncronusly execute multiple jobs in parallel and just check for job completion. Works a treat for parallel loads.

I am not sure why no one is really answering my question. I will go back to google and see if I can find something.

I can't speak for anyone else but my reasons are all outlined above. I don't use SSIS for anything so I can't tell you how to use it.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 8

Thank you for your response Jeff.

I do apologize for my post before, I have been under quite a lot of stress and am severely sleep deprived.

Also, in 9 years of doing this, SSIS (or previously DTS) has always been used to extract data from a source system, so I guess hearing "don't use SSIS" seems a bit odd to me.

I will however give OPENQUERY some testing and see where I get with this.

I do not know however, how to execute concurrent jobs, so if you don't mind, could you please elaborate on that one 🙂

Link to my blog http://notyelf.com/

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 9

Also one further quick question; since I do have 75+ tables I am loading on a nightly basis...would I just create multiple jobs to handle this load instead of multiple SSIS packages?

Link to my blog http://notyelf.com/

robert.gerald.taylor SSChampion Points: 13133 More actions · Answer 10

shannonjk (12/7/2012)
I am aware of how to query data based on the entry dates and change dates. If it were a simple matter of throwing together some SQL then I wouldn't have needed to post this question ;-). Perhaps I should have stated my original question better though 🙂
I am concerned with the HOW on this; as in what data flow items do I use to achieve monitoring the absent keys, newer entry dates, and altered change dates? Or if there is a good link that explains this methodology?

Since you already know how to query your source data, I don't think it's an overly complex task to update your warehouse -- perhaps a little more than throwing together some T-ASQL. 😀 You mention that you have 75+ source tables, so it's not a little thing either. But once you get the idea for deletes, updates and additions you apply it to all of your other tables.

Break it down into parts:

For your deletes from the warehouse:

You don't have to monitor your deleted business keys, just determine what ones are present in your DW each night that are no longer in your source data. These are the records to remove from your Fact tables in the DW. You can do this using a SSIS lookup transformation. If you don''t find the business key, route that down the path to a SQL Command to log and delete from your fact table.

For deletes, remove records from your fact tables before your dimension tables because of the FK referential integrity on your tables.

For additions:

For the same FK reasons, do your additions to your dimension tables first, then your fact tables. You know what records need to be inserted into the DW because you have the source system entry dates since last night (or the last time the ETL finished). You can basically use your current ETL packages, just change the source tranformations to add the WHERE EntryDate > @LastETLRunDate

For updates:

Do your changes after your additions so that any new records in the dimension tables will be present if you have to update your fact table to point to those new dimension records.

You know what records potentially need to be updated in the DW by the change date in your source system. You need to SELECT these records and compare against the fields in the DW. You can do this simply in a conditional split, any records that have changed get routed down the update path to a SQL command to update those records. Take a look at: http://bennyaustin.wordpress.com/2010/05/29/alternatives-to-ssis-scd-wizard-component/. This shows you a couple of options rather than the dismal SCD transformation in SSIS.

Regarding Jeff's comments that this can be done in T-SQL; he's 100% spot on. SSIS is just another tool in your belt. Use whatever tool you (and your team) are most comfortable with. I do all of these types of jobs with SSIS. A T-SQL wizard like Jeff, can do the same tasks using T-SQL.

If I've been talking at too high a level of generalities/abstractions, please ask further questions.

Take a look at Andy Leonard's Stairway series here on SSC where he talkes about adding, deleting and updating records in an incremental process. http://www.sqlservercentral.com/stairway/72494/

HTH,

Rob

robert.gerald.taylor SSChampion Points: 13133 More actions · Answer 11

shannonjk (12/8/2012)
Also one further quick question; since I do have 75+ tables I am loading on a nightly basis...would I just create multiple jobs to handle this load instead of multiple SSIS packages?

I'm assuming you have an overnight window where your data is not being updated.

I would imagine that you'd have a SSIS package for each DW dimension and Fact table; similar tables may be combined. Of those 75 tables, they probably logically break down to a smaller set of logically independent groupings. Say, customers, sales, inventory, ..., for example. You could create a seperate SQL Agent jobs that could run in those SSIS packages in parallel. Of course, they would have to be no dependencies. If the sales needs the customers to be updated first, then they would be in one package.

HTH,

Rob

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 12

Hi Shannon,

I didn't have the time to build you a complete system but this is an example of what I'm talking about. Details, some programmer's notes, and a run-able example are in the code that follows. Each job that you start will run asynchronously. You can start as many jobs as you'd like. Just remember that SQL Server does have some limits. If you start 75 import jobs all at once, you're might just peg I/O to the ceiling and CPU to the wall.

The "command" you pass to this proc could, in fact, be a stored procedure name and parameters that would do your imports for you using just a single proc and a cofiguration table (or direct commands... which ever you prefer).

DROP PROCEDURE dbo.CreateExecSelfDeletingTsqlJob

GO

CREATE PROCEDURE dbo.CreateExecSelfDeletingTsqlJob

/**********************************************************************************************************************

Purpose:

Creates a new job based on the input parameters and executes it.

This example stored procedure has no error checking but will identify if the job started. You can add in the error

checking that you think is necessary.

Programmer's Notes:

1. The job that is created (along with its history) is dropped upon completion whether successful or not.

This proc could be modified to drop the job and retain the history using sp_delete_job instead of having

it automatically drop. That way, you could test to see if the job completed successfully or not.

Usage Example:

DECLARE @IsJobStarted INT,

@JobID UNIQUEIDENTIFIER,

@Command NVARCHAR(MAX)

;

SELECT @Command = N'

IF OBJECT_ID(''tempdb.dbo.MyObjects'',''U'') IS NOT NULL

DROP TABLE tempdb.dbo.MyObjects;

SELECT * --for example only

INTO TempDB.dbo.MyObjects

FROM sys.Objects

;'

EXEC @IsJobStarted = dbo.CreateExecSelfDeletingTsqlJob

@pJobName = N'TestJob',

@pDatabase = N'Master',

@pCommand = @Command,

@pJobID = @JobID OUTPUT

;

--===== Show that the job exists

SELECT * FROM msdb.dbo.sysjobactivity WHERE job_id = @JobID;

PRINT @IsJobStarted;

PRINT @JobID;

--===== Show that the job completed and deleted itself

WAITFOR DELAY '00:00:01'

SELECT * FROM msdb.dbo.sysjobactivity WHERE job_id = @JobID

Revision History:

Rev 00 - 08 Dec 2012 - Jeff Moden

**********************************************************************************************************************/

--===== Define the IO for this procedure

@pJobName SYSNAME,

@pDatabase SYSNAME,

@pCommand NVARCHAR(MAX),

@pJobID UNIQUEIDENTIFIER = NULL OUTPUT,

@pServerInstance SYSNAME = @@SERVERNAME

AS

--===== This creates a self deleting job (see comment in code)

EXEC msdb.dbo.sp_add_job

@job_name = @pJobName,

@enabled = 1,

@notify_level_eventlog = 0,

@notify_level_email = 2,

@notify_level_netsend = 2,

@notify_level_page = 2,

@delete_level = 3, --Makes a self deleting job at job completion

@category_name = N'[Uncategorized (Local)]',

@owner_login_name = N'sa',

@job_id = @pJobID OUTPUT --You'll need this as part of the return

;

--===== This identifies which server instance to run the job on

EXEC msdb.dbo.sp_add_jobserver

@job_name = @pJobName,

@server_name = @pServerInstance

;

--===== This adds the job step. Note that the "command" could be retrieved

-- from a cofiguration table here and could include parameters. So could

-- the database name and the name of the job step (although 'Step 1' works).

EXEC msdb.dbo.sp_add_jobstep

@job_name = @pJobName,

@step_name = N'Step 1',

@step_id = 1,

@cmdexec_success_code = 0,

@on_success_action = 1,

@on_fail_action = 2,

@retry_attempts = 0,

@retry_interval = 0,

@os_run_priority = 0,

@subsystem = N'TSQL',

@command = @pCommand,

@database_name = @pDatabase,

@database_user_name = N'dbo',

@flags = 0

;

--===== This just adds some stuff to the job like which step to start on.

-- It duplicates a lot of the stuff from above (like the auto-delete)

-- so you need to be sure to use the same settings here.

EXEC msdb.dbo.sp_update_job

@job_name = @pJobName,

@enabled = 1,

@start_step_id = 1,

@notify_level_eventlog = 0,

@notify_level_email = 2,

@notify_level_netsend = 2,

@notify_level_page = 2,

@delete_level = 3, --Makes a self deleting job at job completion

@description = N'',

@category_name = N'[Uncategorized (Local)]',

@owner_login_name = N'sa',

@notify_email_operator_name = N'',

@notify_netsend_operator_name = N'',

@notify_page_operator_name = N''

;

--===== Start the job

DECLARE @pIsJobStarted INT

;

EXEC @pIsJobStarted = msdb.dbo.sp_start_job

@job_name = @pJobName,

@server_name = @pServerInstance,

@step_name = 'Step 1'

;

--===== Return with status of if job is running

RETURN ABS(@pIsJobStarted-1)

;

GO

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 13

Jeff thank you that makes sense. I will test this here shortly (or possibly tomorrow as I am getting sleepy!) and run some tests on the OPENQUERY. If run into any hiccups with that I will let you know.

And yes in regards to the jobs running concurrently, I have done quite a lot of extensive testing, it is pretty much 9 tables that are 95% of the data, the rest are much smaller tables so I pretty much run the first 9 in serial, then the rest in a parallel chain.

I will need to figure out a way to figure out how to execute the next batch of jobs once the first batch completes, but I guess one step at a time 🙂

Robert I will give this a whirl as well and weigh the value of both methods to see which will be best for my company. Since we have all these SSIS packages in place I am leaning toward that, but if Jeff's solution tends to be world changing, I may still be swayed that way 😉

Link to my blog http://notyelf.com/

shannonjk SSCarpal Tunnel Points: 4286 More actions · Answer 14

Ok so finally getting the testing done on this.

So far I am finding OPENQUERY to be a slower method. Are there particular things I should be utilizing in order to accomplish this? I am simply starting off with the basic insert script of

select column1, column2 from openquery(sourceserv, 'select column1, column2 from table') as t1

LEFT JOIN destinationtable as t2

on t1.key = t2.key

where t2.key is null.

It is taking quite a long time to accomplish but perhaps I am not using the best method here?

Link to my blog http://notyelf.com/