data flow question

  • Hi all,

    I have a quick question about data flow process best practice as follows:

    I am pulling data from various data source using SQL code (SP & SSIS PKG) to populate a DW

    (Dimension and Fact tables).

    The way it is done is as follows:

    I created a schema called "source" to hold incoming data from various data sources as is, another schema called "Stage" to hold the massaged data (after applying business rules) then a third schema called "DW" to hold dimension and fact tables in a final format for reporting.

    So having 3 schemas in one database? is this the right thing to do or should I have a 3 separate database completely?

    Thanks for your advise.

  • This is a pretty large and broad topics, but here is just my high-level take.

    There is nothing wrong with having a staging area that allows your transform and validate your data before it lands in your data warehouse within the model and schema you choose. Keeping copies of each data source in another schema or database is a waste when you can simply keep information about each source in a metadata repository.

    Focus on the ETL that unifies all data sources together that makes up your data warehouse. Then focus on what you need to store and how you will store the metadata around each source, the business rules for those sources and what each source ultimately transforms into when it lands in your data warehouse.

    Some helpful tips when it comes to staging is to always keep it as a separate database either within the same server or on an entirely different server to help augment the pressure ETL has on the data warehouse environment. You can either stick with one database for all data sources or create a source bus, which is a database per source that all feed into the ETL that feeds into your data warehouse.

    Some helpful tips when it comes to metadata is to understand there are hundreds of data points to track. Try to keep your metadata tied to these three topics per Ralph Kimball: Business, Technical and Process Execution metadata. Business being the description of data in a business sense, Technical being the technical aspects of the data (attributes) such as data types, lengths, lineage and results of data profiling, and Process Execution being the statistics on the results of running the ETL process itself such as rows loaded successfully, rows rejected, rows loaded per batch, etc.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply