Good to have Multiple Data Warehouses?

  • Interesting debate came up on data warehousing. Is it good to have multiple data warehouses or just one?

    To add some context to avoid the ever, "it depends" answer. Let's say you have a data warehouse already built that supports a number of products that are both for reporting purposes and non-reporting purposes. Then another product comes up that is for more analytical purposes that requires data in a more granular fashion and slightly different set of data requirements than what the current data warehouse can support (i.e.: one is very normalized, the other wants to be very denormalized). Does it make sense to split and have two data warehouses supporting different products and use cases between multiple teams?

    The arguments here are pretty valid. One data warehouse, one source of truth. They are both using the same data, just stored and used differently. But, there are a number of organizations that have released papers on multiple data warehouse scenarios that basically state, it's hard to satisfy multiple products and use cases with one single solution. So why restrict your data potential and or constantly have to rewrite your existing models when new things come up?

    Personally, I see no problem with multiple data warehouses if there is justification and they are both not fighting for resources and causing contention on the target data streams. If they are supporting totally different data requirements, it seems logical to have the split and allow them to grow separately while still working together. Ideally, it would be great to stick to one data warehouse where if you needed separate data specific needs, it would result in a new data mart to be developed for those needs. Yet, that still requires the data warehouse to support those needs at the source.

    Is there ever a point where you stop trying to meet all data requirements with your data warehouse and just split?

  • Whenever sets of requirements (Subject/Business) are incompatible with other sets of requirements (Subject/Business), it is appropriate to subset the data into Subject/Business specific Data Marts. My preference is a single Enterprise Data Warehouse as the single source of the truth for the individual Data Marts, that is all Data Marts are fed from the same source.

    😎

  • To wit, I frequently go with the "no data warehouse" option in favor of just writing some streamlined code to solve actual problems instead of a system to handle all problems that have yet to be realized. Sometimes it really is better to go with a dedicated solution rather than trying to catch all possibilities in a generic fashion.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/12/2016)


    To wit, I frequently go with the "no data warehouse" option in favor of just writing some streamlined code to solve actual problems instead of a system to handle all problems that have yet to be realized. Sometimes it really is better to go with a dedicated solution rather than trying to catch all possibilities in a generic fashion.

    Yeah, that's what I was thinking too. It would still lead to multiple data management systems though, whether you called them a warehouse or not. But it's hard to create specialized data marts if the house does not support the requirements. You constantly have to update and conform in order to have one solution be the solution to all.

  • xsevensinzx (11/13/2016)


    Jeff Moden (11/12/2016)


    To wit, I frequently go with the "no data warehouse" option in favor of just writing some streamlined code to solve actual problems instead of a system to handle all problems that have yet to be realized. Sometimes it really is better to go with a dedicated solution rather than trying to catch all possibilities in a generic fashion.

    Yeah, that's what I was thinking too. It would still lead to multiple data management systems though, whether you called them a warehouse or not. But it's hard to create specialized data marts if the house does not support the requirements. You constantly have to update and conform in order to have one solution be the solution to all.

    I agree with the "no data warehouse" if and when it fits the bill, obviously sometimes it does and sometimes it doesn't, correlation of different reporting from those systems is much more error prone and much harder than maintain a single data warehouse. A good example of the latter are situations when working with multiple diverse systems which all hold a slice of the cake but none has the full picture. On the flip side, regulatory and legislative requirements must also be met which often can complicate the picture, certain data cannot be stored, processed or analysed with other data. That situation can force the data the business to use multiple warehouses, i.e. telecom company cannot use Mobile or POTS data for Internet marketing etc.

    😎

  • I believe in the one source of truth, as the ideal. From many sources one place to store. It also keeps enforcement on data standards easy since you are working in only one location. I think you are describing reporting across different grains. Here I take the lowest grain possible since it is safe to assume that future requirements are not fully known. If need be you can have snapshots of the data at particular points in time (usually month ends for a business) that aggregate on the grain that you want. You can segment different snapshots to different business areas through using schemas.

    ----------------------------------------------------

  • In my environment I have many different departments that need to be kept separate from each other, yet share some data elements such as calendars and employee profiles. I work for a tribal government, which means I have to deal with various regulations re: data confidentiality and access. I implement just one data warehouse (SQL Server database) to store all the data, and then I use other databases that just hold views to expose the data to various end user applications. For example, let's say the accounting department and the health clinic both need access to employee data and to the calendar, but each have their own fact tables.

    In the data warehouse, you would see:

    Core_Dim_Employee

    Core_Dim_Date

    Accounting_Fact_GrantHours

    Health_Fact_PatientEncounters

    There would also be two separate databases: HealthDM and AccountingDM (DM standing for data mart)

    HealthDM would have these views:

    Core_Dim_Employee (WHERE Division='Health' to only include Health division employees)

    Core_Dim_Date

    Health_Fact_PatientEncounters

    AccountingDM would have these views:

    Core_Dim_Employee

    Core_Dim_Date

    Accounting_Fact_GrantHours

    For me, this allows me that single source of truth and place to enforce data standards, and for the end user, they can only see the data that they need to see and that interests them. They can point Excel or whatever other tool they want to use at their DM and get the data they need without seeing the other departments' data. If a end user needs a cube built, then I use the appropriate DM for the data source that feeds the cube. Because the DMs only hold views, the data is always matching the data in teh data warehouse.

  • xsevensinzx (11/8/2016)


    Interesting debate came up on data warehousing. Is it good to have multiple data warehouses or just one?...

    Personally, I see no problem with multiple data warehouses if there is justification and they are both not fighting for resources and causing contention on the target data streams. If they are supporting totally different data requirements, it seems logical to have the split and allow them to grow separately while still working together. Ideally, it would be great to stick to one data warehouse where if you needed separate data specific needs, it would result in a new data mart to be developed for those needs. Yet, that still requires the data warehouse to support those needs at the source.

    Maybe we need some clarification on what you mean by multiple data warehouses? In general, a data warehouse can consist of multiple fact tables, each of which relates to multiple dimensions. A fact table is usually modeled based on some business process.

    Offhand it sounds like you just need a different fact table within the same data warehouse. You probably would not want to change the grain of an existing fact table as that would affect all users who currently do reporting and analysis on it. The new fact table will likely be able to reuse some of your existing dimensions though.

    There is lots of good information about designing data warehouses available from the Kimball group:

    http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/

  • JesseBizInt (12/12/2016)


    In my environment I have many different departments that need to be kept separate from each other, yet share some data elements such as calendars and employee profiles. I work for a tribal government, which means I have to deal with various regulations re: data confidentiality and access. I implement just one data warehouse (SQL Server database) to store all the data, and then I use other databases that just hold views to expose the data to various end user applications. For example, let's say the accounting department and the health clinic both need access to employee data and to the calendar, but each have their own fact tables.

    In the data warehouse, you would see:

    Core_Dim_Employee

    Core_Dim_Date

    Accounting_Fact_GrantHours

    Health_Fact_PatientEncounters

    There would also be two separate databases: HealthDM and AccountingDM (DM standing for data mart)

    HealthDM would have these views:

    Core_Dim_Employee (WHERE Division='Health' to only include Health division employees)

    Core_Dim_Date

    Health_Fact_PatientEncounters

    AccountingDM would have these views:

    Core_Dim_Employee

    Core_Dim_Date

    Accounting_Fact_GrantHours

    For me, this allows me that single source of truth and place to enforce data standards, and for the end user, they can only see the data that they need to see and that interests them. They can point Excel or whatever other tool they want to use at their DM and get the data they need without seeing the other departments' data. If a end user needs a cube built, then I use the appropriate DM for the data source that feeds the cube. Because the DMs only hold views, the data is always matching the data in teh data warehouse.

    You mention a separate database to hold your views yet you are referencing the accountingDM database as a location for views as well. To me when you need to create a join across databases you have to decide where you need to store that view if they are not all on a dedicated database. Here you could run into the situation where you would need to store two versions of the same view across two databases due to security restrictions if you wanted more than one user to access that view (if they each only had access to one of the respective databases in the view but both needed access to the information in the view).

    I would take a look at schemas within a database, just my preference.

    ----------------------------------------------------

  • Schemas would work just as well in this example. I opted for the separate databases because it was easier to explain to management in my organization rather than getting into how schemas keep things separate, and the permissions are a bit easier (not much, just a bit) to manage. I also forgot to mention that the enterprise data warehouse is named EnterpriseDW so it is seperate from AccountingDM and HealthDM. The DM databases live on the same server as the DW, and yes, some views are redundant so I just script the view as Create To, copy and paste the code, and adjust the filters to deploy.

    If I ran into a situation where I needed to change the query definition, I could script that via T-SQL with CREATE VIEW statements to push that out. Unfortunately, politics plays a much larger role in some decisions than we would like.

  • Maybe we need some clarification on what you mean by multiple data warehouses? In general, a data warehouse can consist of multiple fact tables, each of which relates to multiple dimensions. A fact table is usually modeled based on some business process.

    Agreed. Is this in reference to a single .sln that contains multiple cubes that have some overlap in sharing dimension objections? A cube that contains multiple fact tables to accommodate related items at a different granularity?

    In general, it would seem to me better to have all the cubes has part of one solution. Each cube contains fact tables that share dimensions to a high degree and are properly related.

  • RonKyle (12/12/2016)


    Maybe we need some clarification on what you mean by multiple data warehouses? In general, a data warehouse can consist of multiple fact tables, each of which relates to multiple dimensions. A fact table is usually modeled based on some business process.

    Agreed. Is this in reference to a single .sln that contains multiple cubes that have some overlap in sharing dimension objections? A cube that contains multiple fact tables to accommodate related items at a different granularity?

    In general, it would seem to me better to have all the cubes has part of one solution. Each cube contains fact tables that share dimensions to a high degree and are properly related.

    Sure.

    I think the best way to explain this is having two data warehouses that support different use cases, one that supports production ready applications on top of canned reporting where the other is more analytical used for data science and machine learning.

    The other big comparison would be Inmon versus Kimball. One is very normalized to support applications with reporting and the other is more denomalized to support analytical reporting.

    Coming together can happen, but it requires a great deal of work and investment. On one side, you have a very normalized less grainular data warehouse and the other side, you have a very large granular data warehouse 3x the size.

    Both are using the same or similar data sources, but different models and use cases. On top of that, both operate differently. One as you can imagine, it is very tied down to ensure performance for those applications where the other is more flexibility with ability to change data, add data and do more to support the custom ad-hoc analytical requests.

    So, it's not as easy as just adding one additional schema to an existing database, especially due to the size. It would almost result in a separate instance regardless of the approach. So why wouldn't two data warehouses make more sense, but maybe feeding each other?

  • Yeah, I've looked into this too. As this topic is pretty old, some more conclusions on my end have been drawn. I kind of started this on more of data model issues, but its beyond just the data model. It's also the engine (SMP versus MPP) as well the business process (time to add new data to a traditional data warehouse). All of these have to really align or be forced for certain approaches to be fully adapted. In a world that is constantly trying to treat SQL Server as a hammer where everything is a nail, this is what is often leading to more than one data warehouse or specifically, source of truth.

    Therefore, what I've concluded is not necessarily what I was thinking of another data warehouse, but actually a simple data lake that sits before the traditional data warehouse. Another piece of technology that allows for different tools for different data problems that allows the business to ingest data fast to where they can start exploring data as soon as it becomes available. Then once that data is realized, then it can be properly cleaned, conformed and dumped into a traditional data warehouse where that tool can make that data sing.

    I think trying to make one tool handle all of this kind of makes it all become well-- a hot mess.

  • After posting 'Data Vault methodology', I found a webpage that says Data Vault is not the ideal technology to use. 
    Rather stick with Kimball.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply