Azure DWH Loading Architecture Dilemma

  • Designing Azure DWH loading infrastructure for client. Client's Data Architecture insists on Source (3NF) > STG(3NF) > ODS (3NF) > DWH (Facts/Dims) which is 4 distinct phases, ... so 3 phases using PaaS Azure SQL Database (Azure DWH when it get big)

    Initial POC, decided to co-hosted STG | ODS | DWH tables in a single Azure SQL DB, logically divided by schemas. Works fine for small data-set POC, but know will problematic in the future memory allocation, backup / restores, Possible disk contention at time of staging->DW tables. In addition there are 20+ more Mart type structures to come, and then the whole environment needs to be re-created for (dev, stg, test, uat).

    Question 1#: Would you (A) co-host STG | ODS | DWH tables in single DB for 20+ sources with multiple tables each using schema's ,.. or (B) break out into 3 separate Azure SQL DBs which all need to be managed, stopped/started, monitored. And then multiplied by the number of environments required totalling 15 Azure SQL DBs. Or (C) Dedicated DB accepting all sources daily incremental staging load (then truncated), dedicated ODS DB which retains on-going copy of all sources, finally individual Azure SQL DBS for each Data Mart which include stg, dim and fact tables.

    Question 2#: Would you create an dedicated cloud project for each environment (dev, test, stg, uat, prod) and insert insert into single Virtual network?

    Appreciate any advice you can give.

  • That's actually a pretty dense set of questions. Any advice I'm offering up here should be taken with a very large grain of salt.

    1) Heck no. Take Azure and everything else out of the picture. No, don't cram multiple functions into a single database. You will regret it. Yeah, you have a single thing to backup & restore, but, let's say staging goes horribly south. That means we're going to take production offline to fix staging. I suspect strongly that would be frowned on in most shops. I would absolutely break those out. It does add to your management overhead, but it makes recoverability very flexible and will absolutely increase the likelihood of keeping production online.

    2) Very hard question. In general, I would insist on separating Dev/QA/Staging from Production as domains, let alone machines. But to more directly answer your question you'd really need to evaluate the costs of managing these within Azure. I'd absolutely look into setting up pools for managing these servers. Beyond that, I'd really want a lot more information before I could make seriously concrete suggestions or even more vague guesses than I already have. Sorry I'm not being more helpful on this one.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Apologies about the loaded questions and appreciate you taking on the challenge Grant.

    With the advent of (MPP) capability within Azure SQL DWH was curious to field the question, is this a game changer due to the additional horse power, and can we host a Inmon enterprise model within a single DB albeit broken up into FGs and Schema's. But as you rightly noted, would raise new challenges about load phase management and backing out of a failed step.

    Another issue the PaaS databases option present is no cross database query. My ETL Developers still prefer the logic of multiple DBs, staging tables from one DB and MERGE them into the DIMs. another DB, hence the reason for puzzling whether combining all the DBs (STG, ODS, DWH) into a single database might work.

    In regarding to create multiple environments for (dev, test, stg, uat). I'm working through this TechEd for ideas. http://www.youtube.com/watch?v=bO3OU7BB8cY and factoring in you advice.

    Coincidentally, in addition to my Azure and warehousing projects, also implementing your DLM methodology for my client - so another big thanks!!!

    Appreciation from London (UK)

  • Martin Sherwood-301653 (1/11/2016)


    Apologies about the loaded questions and appreciate you taking on the challenge Grant.

    With the advent of (MPP) capability within Azure SQL DWH was curious to field the question, is this a game changer due to the additional horse power, and can we host a Inmon enterprise model within a single DB albeit broken up into FGs and Schema's. But as you rightly noted, would raise new challenges about load phase management and backing out of a failed step.

    Azure SQL DW is absolutely a game changer. You won't break it down the same way with FGs. It's much more a matter of just throwing the right sized machine at the problem. I don't have personal experience with this yet, but I'm working on getting it. I think it's one of the most exciting things coming out of Azure these days.

    Another issue the PaaS databases option present is no cross database query. My ETL Developers still prefer the logic of multiple DBs, staging tables from one DB and MERGE them into the DIMs. another DB, hence the reason for puzzling whether combining all the DBs (STG, ODS, DWH) into a single database might work.

    You can do cross database queries if you set your databases up within a data pool. Just got reminded of that today. So, if you really have to do that kind of querying (still not crazy about it), you can.

    In regarding to create multiple environments for (dev, test, stg, uat). I'm working through this TechEd for ideas. http://www.youtube.com/watch?v=bO3OU7BB8cY and factoring in you advice.

    Coincidentally, in addition to my Azure and warehousing projects, also implementing your DLM methodology for my client - so another big thanks!!!

    Appreciation from London (UK)

    Best of luck. Never hesitate to reach out.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hello Martin and Grant,

    Just to add to what Grant has stated with Azure SQL Data Warehouse there is no provisioning of files and filegroups. All that complexity has been taken care of during the provisioning process.

    The data warehouse is actually comprised of 60 storage distributions that contain all the user data. The Azure SQL DW engine sits on top of these storage distributions and basically interprets, cost optimizes and executes your query against the 60 storage distributions. By default data for a given table is spread using a round robin pattern. However, you can also optimize this distribution approach by using the HASH distribution method.

    As you rightly point out today the service doesn't support cross database joins. Therefore to get the logical separation it is best to use schemas. Environment separation is best handled using separate SQLDW databases. This way you can scale the environments appropriately. However, you can also consider holding source data in flat files under blob storage and simply create external tables for them. This is a great way to maintain history and provide you with opportunities to reload raw data at any time if needed. You may even find this preferable to running an d managing an ODS.

    I should also point out that the MERGE statement is not currently supported. Depending on the size of the operation you can perform an UPSERT on the data using a CTAS operation.

    CREATE TABLE dbo.DimProduct_upsert

    WITH

    (

    Distribution=HASH(ProductKey)

    , CLUSTERED INDEX (ProductKey)

    )

    AS

    SELECT

    CASE WHEN s.ProductAlternateKey IS NULL

    THEN p.ProductKey

    ELSE s.ProductKey

    END as ProductKey

    ,CASE WHEN s.ProductAlternateKey IS NULL

    THEN p.ProductAlternateKey

    ELSE s.ProductAlternateKey

    END as ProductAlternateKey

    ,CASE WHEN s.ProductAlternateKey IS NULL

    THEN p.EnglishProductName

    ELSE s.EnglishProductName

    END as EnglishProductName

    ,CASE WHEN s.ProductAlternateKey IS NULL

    THEN p.Color

    ELSE s.Color

    END as Color

    FROM DimProduct p

    FULL JOIN stg_DimProduct s ON p.ProductAlternateKey = s.ProductAlternateKey

    The resulting table can either be renamed to be the new table using the RENAME object syntax or you could opt for a partition switch instead.

    I hope that helps to answer your questions. If you have more please feel free to post them and I will try to do a better job of keeping an eye on the forum. 🙂

    KR, JRJ

  • Hi KR / Grant

    Thanks for more highly useful information which really helps people building DWH architectures within Azure.

    Common barriers encountered are the MERGE statement, cross database queries, IaaS vs PaaS from functionality requirements, Azure DB vs Azure DWH. Here is a potential workaround for the cross database queries using elastic database query.

    https://azure.microsoft.com/en-gb/blog/querying-remote-databases-in-azure-sql-db/

    Thanks once again

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply