Blog Post

Serving layers with a data lake

,

Data lakes typically have three layers: raw, cleaned, and presentation (also called bronze, silver, and gold if using the medallion architecture popularized by Databricks). I talk about this is my prior blog post on Data lake architecture. Many times, companies will create a fourth layer outside of the data lake that I call the relational serving layer. I’ve been having conversations recently with companies about the need for another type of fourth layer, which I will call the physical serving layer. In this blog post I’ll discuss the relational serving layer and the physical serving layer.

Typically, once data is in the presentation layer, it is “ready to go”. From there the data can be shared via a tool like Azure Data Share, or an end-user can access the data directly in the presentation directly with a tool like Power BI.

Because a data lake is schema-on-read, the schema is applied to the data at the time of reading the data, rather than beforehand.  A data lake is a folder-file system, where there is no context of what the data is, which is different from what you get with a relational database which has a metadata presentation layer on top of the data and that is tied directly to the data.  So, you might want to create a “serving layer” on top of the data in the data lake which will tie metadata directly to the data.  To make it easier for an end-user to find and understand data, you will likely want to present the data in the form of a relational data model, hence creating a relational serving layer on top of the data.  If done correctly, the end-user will have no idea they are actually pulling data from a data lake – they will think it is from a relational data warehouse. 

The relational server layer can be created in many ways, such as a SQL view, a dataset in a reporting tool like Power BI, an Apache Hive table, or in an ad-hoc SQL query.  With this layer you can also define the relationships between files if you need to join more than one together. This is a common practice because defined relationships do not exist within a data lake, as each file is in its own isolated island. 

As an example of using a relational serving layer, many companies will create SQL views on top of files in the data lake, then use a reporting tool to call those views, making it easy for an end-user to create reports or dashboards because it gives the end-user a “relational model” on top of the data instead of seeing everything in a folder-file format.

Another type of serving layer could be a physical serving layer. This is where data is copied from the presentation layer in the data lake to one or more products as a way that to make it easier for end-users to access – products such as Azure Cosmos DB, Azure SQL Database, or a graph database, just to name a few. This is to help satisfy different lines of business that have different needs for the data. For example, say a department within your company is very familiar with Azure SQL Database and have built applications and use reporting tools that go against data in an Azure SQL Database. Instead of having that department pull data from the data lake into an Azure SQL Database that resides within the department, IT does it for them, but puts the data into an Azure SQL Database that IT maintains in the physical serving layer (you could think of that data as a “datamart“). This approach eliminates the work needed by the department to get the data ready, and they can spend that extra time getting more insights into the data.

So, what has to be decided is in what cases IT would do the work compared to a department or end-user doing the work to get data in the format that is needed? One consideration would be how many people would use the data in the physical layer – just a few people in one department in your company, or many people across multiple departments? Does IT have the resources and availability to do this work, or would they become a bottleneck because their resources are constrained?

You should also consider the benefits of having a physical serving layer:

  • It can help control costs, especially if multiple departments need the data in this new format so you are preventing duplicate data
  • It could help improve query and report performance since IT likely has better expertise and can tune the resulting data in the physical serving layer
  • It helps with data lineage since the physical serving layer is controlled by IT as opposed to a department taking the data and storing it in a place IT may not have access to
  • It helps with data governance and security since IT can incorporate the physical serving layer into their governance policies and security environment as opposed to hoping each department has the proper governance and security in place

Previously I created a short YouTube video (20 minutes) that is a whiteboarding session that describes the five stages (ingest, store, transform, model, visualize) that make up a modern data warehouse (MDW) and the Azure products that you can use for each stage. You can view the video here. How the relational serving layer and physical serving layer changes things is at the model stage, where instead of it being a RDBMS such as an Azure Synapse dedicated pool that all departments and end-users will access, it will be a relational serving layer that removes the need to copy the data into a RDBMS, giving you a true data lakehouse. You still might utilize a physical serving layer that does have a RDBMS, but that would be for specific departmental use cases that IT is creating for them instead of the department having to build and maintain it themselves.

One thing to point out is that within the presentation or gold layer you could have multiple copies of the same data in different formats, such as 3rd-normal form or star schemas, Microsoft Fabric lakehouses or warehouses, or copies to help with performance or cost savings via features like object tiering. These are all happening within the data lake, as opposed to the physical serving layer that is outside the data lake. Hopefully I’m not confusing things too much 🙂

I’d be remiss if I did not mention that you need to pay strong attention to data governance which will be even more important when you have serving layers. Using a product like Microsoft Purview is the easy part – it’s implementing best practices for data governance that will be challenging and time consuming.

I hope this helps to give you some ideas on how to make things better for the consumers of the data lake!

The post Serving layers with a data lake first appeared on James Serra's Blog.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating