Blog Post

Data lake architecture

,

I have had a lot of conversations with customers to help them understand how to design a data lake. I touched on this in my blog Data lake details, but that was written a long time ago so I wanted to update it. I often find customers do not spend enough time in designing a data lake and many times have to go back and redo their design and data lake build-out because they did not think through all their use cases for data. So make sure you think through all the sources of data you will use now and in the future, understanding the size, type, and speed of the data. Then absorb all the information you can find on data lake architecture and choose the appropriate design for your situation.

A data lake should have layers such as:

  • Raw data layer– Raw events are stored for historical reference, usually kept forever (immutable). Think of the raw layer as a reservoir that stores data in its natural and original state. It’s unfiltered and unpurified.  Advantages are auditability, discovery, and recovery. A typical example is if you need to rerun an ETL job because of a bug, you can get the data from the raw layer instead of going back to the source. Also called bronze layer, staging layer or landing area. Sometimes there is a separate conformed layer (or base layer) that is used after the raw layer to make all the file types the same, usually parquet.
  • Cleansed data layer – Raw events are transformed (cleaned and mastered) into directly consumable data sets. Think of the cleansed layer as a filtration layer. It removes impurities and can also involve enrichment. The aim is to uniform the way files are stored in terms of encoding, format, data types and content (i.e. strings and integers). Also called silver, transformed, integrated, or enriched layer
  • Presentation data layer – Business logic is applied to the cleansed data to produce data ready to be consumed by applications (i.e. data warehouse application, advanced analysis process, etc). The data is joined and/or aggregated, and can be stored in de-normalized data marts or star schemas. Also called application, workspace, trusted, gold, secure, production ready, governed, curated, or consumption layer
  • Sandbox data layer – Optional layer to be used to “play” in, usually for data scientists. It is usually a copy of the raw layer.  Also called exploration layer, development layer or data science workspace

Within each layer there will be a folder structure, which is designed based upon reasons such as subject matter, security, or performance (i.e. partitioning and incremental processing). Some good examples of this can be found in the doc Data lake zones and containers, and from one of my favorite bloggers, Melissa Coates (Coates Data Strategies): Zones in a Data LakeData Lake Use Cases and Planning ConsiderationsFAQs About Organizing a Data Lake and the PowerPoint Architecting a Data Lake. Also, see my video Modern Data Warehouse explained for how data is moved between the layers.

With rare exceptions, all layers use Azure Data Lake Storage (ADLS) Gen2. I have seen some customers use Azure Blob Storage for the raw layer because it was a bit cheaper, or if there are huge demands on throughput it may be a good way to isolate ingestion workloads (using Blob Storage) from processing/analytics workloads (using ADLS Gen2).

Most times all these layers are under one Azure subscription. Expectations are if you have specific requirements for billing, you will hit some subscription limit, or you want separate subscriptions for dev, test and prod.

Most customers create ADLS Gen2 storage accounts for each layer, all within a single resource group. This provides isolation of the layers to help with performance predictability, and allows for different features and functionality at the storage account level, such as lifecycle management options or firewall rules, or to prevent hitting some storage account limit (i.e. throughput limit).

Most data lakes make use of Azure storage access tiers, with the raw layer using the archive tier, cleansed using the cold tier, and presentation and sandbox using the hot tier.

I recommend some type of auditing or integrity checks be put in place to make sure the data is accurate as it moves through the layers. For example, if the data is finance data, create a query that sums up the total of the days orders (count and sales total) to make sure the values are equal in all the data layers (and comparing to the source data).

A large majority of customers are using delta lake in their data lake architecture because of the following reasons:

  • ACID transactions
  • Time travel (data versioning enables rollbacks, audit trail)
  • Streaming and batch unification
  • Schema enforcement
  • Supports commands DELETEUPDATE, and MERGE
  • Performance improvement
  • Solve “small files” problem via OPTIMIZE command (compact/merge)

Usually the raw data layer does not use delta lake, but cleansed and presentation do use it. See Azure Synapse and Delta Lake for more info about delta lake.

More info:

Book “The Enterprise Big Data Lake” by Alex Gorelik

Building your Data Lake on Azure Data Lake Storage gen2 – Part 1

Building your Data Lake on Azure Data Lake Storage gen2 – Part 2

The Hitchhiker’s Guide to the Data Lake

Should I load structured data into my data lake?

What is a data lake?

Why use a data lake?

THE DATA LAKE RAW ZONE

Video Data Lake Zones, Topology, and Security

The Hitchhiker’s Guide to the Data Lake

The post Data lake architecture first appeared on James Serra's Blog.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating