Staging Tables do not contain a DateTime Stamp or Source System, etc.

Question

Post reply

Staging Tables do not contain a DateTime Stamp or Source System, etc.

Welsh Corgi

SSC Guru

Points: 116520
More actions
March 30, 2012 at 3:08 pm

#259338

I just reviewed Staging Tables and the Tables in the Data Warehouse.
The Data Warehouse is not populated but I thought that it was odd that the Staging Tables did not have columns to capture when the Record was inserted or the source system but the DW Tables do.
The response that I got via email was that they do contain DateTime Columns but I checked and they contain text for example '1100901' as a Date when they were entered into a AS400 Database and the Date & Time Columns are defined as Int.
We will have multiple system, SQL Server for example and as far as I know that Data should be consistent and use a generic column name not the name & Data Type of the first system that they load.
I find this odd, any comments or feedback would be appreciated?
For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 1

From the Data Warehouse Toolkit Second Edition Ralph Kimball, Margy Ross page 98:

It certainly would feel natural to include an effective date stamp on a dimension

row with type 2 changes. The date stamp would refer to the moment

when the attribute values in the row become valid or invalid in the case of

expiration dates. Effective and expiration date attributes are necessary in the

staging area because we’d need to know which surrogate key is valid when

we’re loading historical fact records. In the dimension table, these date stamps

are helpful extras that are not required for the basic partitioning of history. If

you use these extra date stamps, just remember that there is no need to constrain

on the effective date in the dimension table in order to get the right

answer. This is often a point of confusion in the design and use of type 2 slowly

changing dimensions.

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 2

Any ideas would be greatly appreciated.

Thank you.

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 3

IIRC, the '1100901' is actually a full date with century included but things in the AS400 world are a wee bit different. I believe that the first 3 characters represent the number of years since the first of January, 1900. With that thought in mind, the following should be able to make a fairly fast conversion for you so you can update the DW from the Staging table(s) using the DATETIME datatype.

DECLARE @SomeAS400Date INT;

SELECT @SomeAS400Date = 1100901;

SELECT DATEADD(yy,@SomeAS400Date/10000,'1900'+RIGHT(@SomeAS400Date,4))

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 4

Welsh Corgi (3/30/2012)
Any ideas would be greatly appreciated.

It all depends on the design and how the architect decided to track changes - I have seen cases where the DWH does not tracks changes; which in my opinion is not the most brilliant approach you can think of.

Assuming we are talking about a dimensional Data Wharehouse, my personal favorite techniques for tracking changes are Type 2 SCD for the DIMensional tables and Fact Snapshots for the FACTual tables.

It is my opinion that the timestamps needed to accomplish the above mentioned tracking changes techniques should come from the source systems therefore they should be present on the Staging tables. If for any reason such timestamps do not exist on the source system I would create them when loading the staging tables.

It is also my opinion that it is a good idea to track the source system - this is a function of the ETL process which should add the source system signature to every single staging row.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 5

Jeff Moden (3/31/2012)
IIRC, the '1100901' is actually a full date with century included but things in the AS400 world are a wee bit different. I believe that the first 3 characters represent the number of years since the first of January, 1900. With that thought in mind, the following should be able to make a fairly fast conversion for you so you can update the DW from the Staging table(s) using the DATETIME datatype.
DECLARE @SomeAS400Date INT;
SELECT @SomeAS400Date = 1100901;
SELECT DATEADD(yy,@SomeAS400Date/10000,'1900'+RIGHT(@SomeAS400Date,4))

Jeff you are absolutely correct. I should have included in my post.

I use the following function on AS400 Data that has been loaded to SQL Server Databases:

CREATE FUNCTION [dbo].[udf_ConvertDB2toSQL](@CSC_DATE varchar(7))

RETURNS DateTime

AS

BEGIN

DECLARE @SQLDate Datetime

SET @SQLDate = Stuff(

Stuff(

Stuff(@CSC_Date, 6, 0, '-')

, 4, 0, '-')

, 1, 1, CASE LEFT(@CSC_DATE, 1) WHEN 0 THEN '19' WHEN 1 THEN '20' END

)

RETURN @SQLDate

END

But since it is based on an OPENQUERY I would have to use your code or a variation of it.

I guess my questions is should this conversion be done on the load into staging or from staging the the dimension tables?

Thank you very much.

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 6

PaulB-TheOneAndOnly (3/31/2012)
Welsh Corgi (3/30/2012)
Any ideas would be greatly appreciated.
It all depends on the design and how the architect decided to track changes - I have seen cases where the DWH does not tracks changes; which in my opinion is not the most brilliant approach you can think of.
Assuming we are talking about a dimensional Data Wharehouse, my personal favorite techniques for tracking changes are Type 2 SCD for the DIMensional tables and Fact Snapshots for the FACTual tables.
It is my opinion that the timestamps needed to accomplish the above mentioned tracking changes techniques should come from the source systems therefore they should be present on the Staging tables. If for any reason such timestamps do not exist on the source system I would create them when loading the staging tables.
Since I posted this I got a response that the dimensional tables had DateTime Stamps and a Source System Column.
The so called Data Warehouse does not use any prefix such as dim_, fact_, ods_, etc.
Most of the tables are relational and they reflect the structure of the first source system which is a legacy system AS400. The same type of data is stored in a SQL Server Database and which is quite different.
I was told that the dimension tables contained a "Source_Code" Column. I searched the system tables and I was able to identify which tables were dimension tables, otherwise it was not that obvious.
The fact tales stuck out because they contain a number of FK's.
What I found weird is there were so many tables that were some sort of variation of a time dimension.
The design is not anything close to what I have seen anywhere. No Indexes on all of the tables with the exception of the primary keys, mixed staging, etc in the same Database with the Data Warehouse and relational tables.
To recap there are no DateTime Stamp columns of when the data was loaded into the Staging Tables. Perhaps that is ok but personally I would have added them.
The source system and effective start & end date columns are in the Dimension Tables only, no where else.
Perhaps this is ok but I would have take a different approach?
It is also my opinion that it is a good idea to track the source system - this is a function of the ETL process which should add the source system signature to every single staging row.

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 7

When I asked about not having a dateTime Stamp to distingish when the data was loaded after packaging the incremental loads into a SSIS Package so that if one step or more failed (20 tables per package, 8 packages) I could easily distinguish between the initial load and the incremental loads I got the following response.

When the incremental loads to staging can simply be restarted. Since it does a compare to what is currently in staging to the source, it would basically start back up where it died and would not need to be restored.

Load times are slow because we have to deal with linked servers (which oly have PK's).

IMHO it would be better to purge the records loaded via the incremental load as opposed to having a script doing a compare.

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

PaulB-TheOneAndOnly SSC Guru Points: 66772 More actions · Answer 8

Welsh Corgi (3/31/2012)IMHO it would be better to purge the records loaded via the incremental load as opposed to having a script doing a compare.

Agreed.

It is usual for jobs loading Staging tables to be re-runnable in case of a failure; the standard procedure is to truncate the offending Staging table then load it with whatever data gets extracted from the source system.

_____________________________________
Pablo (Paul) Berzukov

Author of Understanding Database Administration available at Amazon and other bookstores.

Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 9

PaulB-TheOneAndOnly (3/31/2012)
Welsh Corgi (3/31/2012)IMHO it would be better to purge the records loaded via the incremental load as opposed to having a script doing a compare.
Agreed.
It is usual for jobs loading Staging tables to be re-runnable in case of a failure; the standard procedure is to truncate the offending Staging table then load it with whatever data gets extracted from the source system.

Paul,

In this case truncating is not an option. I have 127,084,590 records being loaded in the initial load from the first source system. Many tables are in the millions. I tried doing fast loads on all tables but some failed so I had to revise and go with a regular transformation. I got truncation errors on several tables which was an erroneous error so I had to use open queries encapsulated within Stored Procedures. The initial load from the 1st source system takes 12 hours.

Before it was handed off to me it was taking 16.5 hours in an attempt to load a table with 13.5 million records and it failed.

I have not been handed off the incremental's, was supposed to happen COB Friday.

Is it not common practice to use generic Staging Tables as opposed to duplicating them when you have more then one source system with the same data in a different structure?

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 10

The most common practice the I've personally seen is to load Staging tables in whatever the "format" of the original file is and then do what's necessary to get them into the final tables. Of course, most people aren't trying to load 127 million rows at a time, either. 🙂

That, notwithstanding, if the generic Staging table is in the same "format" as the final table, you might be able to take advantage of SWITCH with table partitioning to instantly "load" the data from the Staging table to the final table. It would mean that you have to rebuild an empty staging table (the staging table would be "absorbed" by the final table) for each new import but that's virtually instantaneous, as well. To be clear, I've not tried this method but I thought I'd mention it as an option.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 11

Welsh Corgi (3/31/2012)
Load times are slow because we have to deal with linked servers (which oly have PK's).

If that's the case, wouldn't it be better if they exported the data into a file and then you use BULK INSERT to load it?

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 12

Jeff Moden (4/1/2012)
Welsh Corgi (3/31/2012)
Load times are slow because we have to deal with linked servers (which only have PK's).
If that's the case, wouldn't it be better if they exported the data into a file and then you use BULK INSERT to load it?

I thought that it would be faster as well.

In fact there is a nightly processes that is updating SQL Server Databases from the same source which is owned by a very large company.

I suggested that they look at that and try to piggy back on that. The load to the DW is competing with the other vendors load.

I pointed them to the Database that is used and the table that tracks what has changed at a contract level. The vendor does not track what has changed with all of the other tables, they simply reload them.

I suggested that they not use Linked Servers, replaced most of the OPENQueries and reduced the 1st initial load from over a week to just over 12 hours.

All I can do is make suggestions.

Thanks.:-)

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Welsh Corgi SSC Guru Points: 116520 More actions · Answer 13

Jeff Moden (4/1/2012)
That, notwithstanding, if the generic Staging table is in the same "format" as the final table, you might be able to take advantage of SWITCH with table partitioning to instantly "load" the data from the Staging table to the final table. It would mean that you have to rebuild an empty staging table (the staging table would be "absorbed" by the final table) for each new import but that's virtually instantaneous, as well. To be clear, I've not tried this method but I thought I'd mention it as an option.

I tried to convince them to buy the Enterprise Edition of SQL Server 2008 R2 x64 but they did not what to spend the money. 🙁

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better answers on performance questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 14

Not having the Enterprise Edition might not be a showstopper. Partitioned views can be almost as effective if you can create the right constraint that each table would require for such a thing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)