How to create staging table to handle incremental load

  • Hi Team,

    We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.

    In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.

    The query which is loading tbl_EmployeRecord is, SELECT EMPID,EMPNAME,DEPTNAME FROM tbl_Department D INNER JOIN tbl_Employee E ON D.DEPARTMENTID=E.DEPARTMENTID.

    Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.

    The columns of the tables are,

    tbl_Department : DEPARTMENTID,DEPTNAME

    tbl_Employee : EMPID,EMPNAME,DEPARTMENTID

    tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME

    Kindly suggest how to design the staging for this to handle Insert, Update and Delete.

    Regards

    Jim

  • use "Merge" operation for incremental update

  • Hi,

    In my scenario, I'm joining two tables and loading to single table in target. Will merge will help? Please suggest. If not any other work around is appreciated.

    Regards

    Jim

  • Jim1234 (1/2/2014)


    Hi Team,

    We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.

    In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.

    The query which is loading tbl_EmployeRecord is, SELECT EMPID,EMPNAME,DEPTNAME FROM tbl_Department D INNER JOIN tbl_Employee E ON D.DEPARTMENTID=E.DEPARTMENTID.

    Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.

    The columns of the tables are,

    tbl_Department : DEPARTMENTID,DEPTNAME

    tbl_Employee : EMPID,EMPNAME,DEPARTMENTID

    tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME

    Kindly suggest how to design the staging for this to handle Insert, Update and Delete.

    Regards

    Jim

    tbl_Department - I would assume that this should be done first, as you are probably enforcing referential integrity (no employee can be assigned to a Department that doesn't exist)

    tbl_Employee - then process this one

    tbl_EmployeeRecord - table? I think this should be just a view off the first 2 tables.

    using merge as suggested.

    but with something like this, I would think tracking changes would also be important.

    so more thought about the overall design might be needed, or maybe I'm just trying to read to much into this.

    headcount at any point in time, or knowing when and who (or what) made the change would be questions I could see coming up

  • Hi Jim,

    We are using something similar and MERGE is just perfect for this as mentioned. In our solution we are creating a work table and load the data in it. This table can have referential constraints to avoid wrong data come in (since you already have that you can skip this). In the next phase you can use the MERGE to load the data to your final table in one shot. Lastly you may want to clear your table with a truncate or so.

    Regarding the MERGE you can use like this:

    MERGE tbl_EmployeRecord AS destination

    USING (SELECT EMPID,EMPNAME,DEPTNAME FROM tbl_Department D INNER JOIN tbl_Employee E ON D.DEPARTMENTID=E.DEPARTMENTID) AS source

    ON destination.EMPID = source.EMPID

    WHEN NOT MATCHED THEN

    INSERT (EMPID, EMPNAME, DEPTNAME)

    VALUES (source.EMPID, source.EMPNAME, source.DEPTNAME)

    WHEN MATCHED THEN

    UPDATE ...

    ;

    I have tried on my sample DB and worked perfectly, hope this helps :-). Update is not implemented, but you can extend it based on your requirements.

    Regards,

    Tibor

  • Joining the 2 tables into 1, you lose the data normalization that was done.

    Do as you wish, it's your choice.

    At some point, depending on what is done, you may lose keys to the castle.

    A key is a KEY, where Names (even a Department), can change and overlap.

    Keys in the fact tables, and then pick up the descriptions as needed.

    It may not happen with this example, but if you start out on this path, it will eventually happen.

  • Thanks Tibor. Few questions are coming in my mind. The volume of data in source (for actual tables) is too high. So we are planning for a staging layer in which only those records which got Inserted/updated/deleted will be present. From the staging we need to load the target by using the join query.

    The challange is how to design the staging table. Will it be exactly like the source tables or staging will have only one table like the target. How to handle insert/update/delete for the incremental records.

    Feel free to let me know if you need more details. Thanks.

    Jim

  • Well, you need to do some data transformation somewhere, let's say between the source and staging or staging and destination. To choose between these two you need to check your process, in my point of view the question is how you fill the staging table.

    If you fill the staging table with an SSIS job probably you want to do the ETL operations right in the job so your staging would be similar than the destination table. But if you are using a trigger or an action from the output clause of a query you may want to execute this as fast as you can, in this case I would load a staging table which is similar to the source.

    Anyway, if you are using the MERGE statement with a proper index on the merged tables that can be blazing fast. And you even don't need to know if the records were deleted, inserted or updated because those will merged.

  • If the new and changed records can be easily identified, I would side on the staging tables having the same structure as the source data. If not, and you effectively have to compare today's snapshot with the existing data, I would side with a denormalised table and then use MERGE/EXCEPT/INTERSECT to identify the new and changed records.

    This is known as Slowly Changing Dimension data. Are you keeping history (type 2 SCD) or overwriting changed records (type 1 SCD).

    If you have a lot of this to do, you might want to look at the Data Vault model. Briefly, DV breaks data into three components: Hubs - The entity identifiers, Links - the relationshipts between Hubs, and Satellites - the interesting stuff about the Hubs. In your example EmployeeID and DepartmentID will be HUBS, EmployeeID-DepartmentID is a LINK, and EmpName, dob, paygrade etc go into a SATELLITE as does DeptName, Dept Location.

    Aaron

  • Hi Aaron,

    We are getting the incremental load from source table. So we are maintaining the same structure as the source tables.

    In this senario, insert into tbl_Department or tbl_Employee or both should not be a problem to load into the single target table tbl_EmployeRecord by using the join query.

    Update and Delete will be the challange. Kindly suggest if we have only the staging tables which are like the source tables will suffice or any other tables for ex, to identify the dependent columns for the driving table or any other tables will help.

    Please suggest. Any sample links or documents or table structures will help a lot!!

  • Jim1234 (1/5/2014)


    Thanks Tibor. Few questions are coming in my mind. The volume of data in source (for actual tables) is too high. So we are planning for a staging layer in which only those records which got Inserted/updated/deleted will be present. From the staging we need to load the target by using the join query.

    The challange is how to design the staging table. Will it be exactly like the source tables or staging will have only one table like the target. How to handle insert/update/delete for the incremental records.

    Feel free to let me know if you need more details. Thanks.

    Jim

    If you can capture keys for add / change / deletes you are interested in with triggers on the source system, then use this to drive the staging, you should be able to handle about any volume.

    To mirror the source for the most part will make it easier to understand and maintain.

    I have a hard time understanding why you would keep saying the target has to be a single table, not a view.

    Your description is just a logical representation of the 2 tables, not any real transformation logic.

    Give more thought to the SDC suggestions.

    Architecture can have huge impacts to performance, scalability, and ease of maintaining and adding changes.

    It will have much more payback when modeling more complex situations.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply