I am starting to see this relatively new phrase, “Data Lakehouse”, being used in the data platform world. It’s the combination of “Data Lake” and “Data Warehouse”. In this post I’ll give my thoughts on it, and how the next version of Azure Synapse Analytics that is in public preview fits right in with the Data Lakehouse.
First, I want to clear up a bit of confusion regarding Azure Synapse Analytics. If you login to the Azure portal and do a search, you will see:
Notice that Azure SQL Data Warehouse (SQL DW) is now called Azure Synapse Analytics. This product has been generally available (GA) for a while and is simply a renaming. It contains new features and soon-to-be new features that I blogged about here, but this is separate from other features in a public preview version that I will call “Synapse PP” (see the new docs). I’ll call the current GA version “Synapse GA”. I blogged about this here.
I like to think a major feature of a Data Lakehouse as the ability to use T-SQL to query data the same way no matter whether the data is in a data lake (storage such as ADLS Gen2) or a data warehouse (a relational database such as Synapse GA).
Here are some of the major new Synapse PP features that support a Data Lakehouse:
- Azure Synapse Studio
- Collaborative workspaces
- Apache Spark
- On-demand T-SQL, see my blog on this at SQL on-demand in Azure Synapse Analytics
- ADF integration
- Power BI integration
- Distributed T-SQL (over ADLS Gen2)
There are some great benefits to having a data lake that I blogged about here. And there are many reasons to also have a relational data warehouse along with your data lake that I blogged about here. Up until now, the data lake and the relational data warehouse where sort of on their own island, having to jump to a different product to interface with each. For example, your data lake could be in ADLS Gen2 and you can use Azure Databricks to query it with SparkSQL (first setting up a connection to ADLS Gen2), while your relational data warehouse could be in Synapse GA and you use SSMS to query it with T-SQL. Having to use Azure Data Factory or Power BI means opening up another tab in your browser, logging into those products, and connecting to ADLS Gen2. So it can be time consuming and costly to create an environment to query both, and also be confusing to query using two different versions of SQL.
You can now get the best of both worlds all under one roof via Azure Synapse Studio: using the relational data warehouse when you need fast query performance, high user concurrency, enhanced security, or just prefer working in a relational environment; or using the data lake when you have huge data volumes, need instant access to the data via schema-on-read, or need to deal with semi-structured data. You never have to leave your workspace no matter which tool you want to use, and can use T-SQL no matter if the data is in the data lake or in the relational data warehouse. And querying a file in the data lake is simply a matter of right-clicking the file (more on this in my next blog).
So at a high level you can view Azure Synapse Studio as supporting business intelligence, data science, T-SQL plus other languages (Python, Scala, C#, SparkSQL), schema flexibility, ACID compliance, any data type, petabytes of data, streaming and batch, with high performance and user concurrency.
So as you can see, Synapse PP combines the use of a data lake and a relational database to make it a Data Lakehouse. It is not really a “new” technology, but just a merging of existing technologies to make it easier to gain insights from data in order to make better business decisions.
With these new features, especially SQL on-demand and T-SQL against ADLS, I am starting to see a few use cases where you may not need a relational database anymore in your modern data warehouse, which goes against my previous way of thinking (see Is the traditional data warehouse dead?). I still feel you will want to have a relational database in your modern data warehouse architecture a large majority of the time, but there will be exceptions.
For example, I can land a bunch of parquet files into the data lake and create a T-SQL view on top of that data, where that view is stored in SQL on-demand. Then I call that view from Power BI making it appear to the end-user that they are using a relational database and at the same time they are only paying when the query is being run. So that leads to compute costs savings as well as not having to copy the data from the data lake to a relational database. This means less time is needed to build a solution and there is a reduction in the complexity of the solution and therefore additional cost savings. Below are some of the concerns that arise out of skipping the relational database and the new features that “counter” some of those concerns:
- Speed: Queries against a relational storage will always be faster than against a data lake because of the lack of statistics, query plans, and indexes in the data lake. Counter: DirectParquet, CSV 2.0, query acceleration, predict pushdown, and sql on-demand auto-scaling are some of the features that can make queries against ADLS be nearly as fast as a relational database. Then there are features like Delta lake and the ability to use statistics for external tables that can add even more performance. Plus you can also import the data into Power BI, use Power BI aggregation tables, or import the data into Azure Analysis Services to get even faster performance
- Security: Row-level security (RLS), column-level security, and dynamic data masking are security-related features that are not available in a data lake. Counter: User RLS in Power BI or RLS on external tables instead of RLS on a database table, which then allows you to use result set caching in Synapse
- Complexity: Schema-on-read (ADLS) is more complex to query than schema-on-write (relational database). Schema-on-read means the end-user must define the metadata, where with schema-on-write the metadata was stored along with the data. Then there is the difficulty in querying in a file-based world compared to a relational database world. Counter: Create a SQL relational view on top of files in the data lake so the end-user does not have to create the metadata, which will make it appear to the end-user that the data is in a relational database. But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources
- Missing features: Auditing, referential integrity, ACID compliance, updating/deleting rows of data, data caching, full support of T-SQL are not available in a data lake. Counter: many of these features can be added when using a Delta Lake, but will not be as easy to implement as a relational database
Some additional reasons for using a relational database include helping to break up a large problem into smaller pieces (think of the relational database as an additional step in the data journey from a data lake to the end-user), and the need to handle slowly changing dimensions (SCD). Using time travel in the Delta Lake or creating daily folders for each table in the data lake are options for handling SCD but much more complex than in a relational database world. I also see some customers who have source systems that are relational databases and as they are in the early stages of building a modern data warehouse in Azure they will bypass the data lake and copy the relational data right into a relational data warehouse (especially if they have many existing SSIS packages that they are using for their on-prem data warehouse and they want to continue using those for the short-term). They do this for quick wins and eventually will land the data in the data lake.
A strong reason for skipping a relational database is because, for the short term, SQL on-demand is not able to access a relational database, so you can only use this feature against a data lake. So if you want to save costs by using SQL on-demand it would have to be against data in the data lake. This could be especially true if you have a “small” data warehouse.
I’m still thinking through this, but these are my initial thoughts and it will be interesting to have discussions with customers on this topic when Synapse PP goes GA. Please comment below on your thoughts!
I’ll be presenting and doing a demo of the public preview of Azure Synapse Analytics at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/10 (session info), and Big Data Conference Europe on 11/25 (session info).