Blog Post

Azure Synapse and Delta Lake


Many companies are seeing the value in collecting data to help them make better business decisions. When building a solution in Azure to collect the data, nearly everyone is using a data lake. A majority of those are also using delta lake, which is basically a software layer over a data lake that gives additional features. I have yet to see anyone using competing technologies to delta lake in Azure, such as Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi and Open Source Data Lake Table Formats: Evaluating Current Interest and Rate of Adoption).

The reasons most are using delta lake is because of the following features that delta lake provides over just using a data lake (with supporting the MERGE statement the biggest one):

  • ACID transactions
  • Time travel (data versioning enables rollbacks, audit trail)
  • Streaming and batch unification
  • Schema enforcement
  • Supports commands DELETEUPDATE, and MERGE
  • Performance improvement

Fortunately most Azure products now support delta lake, such as:

However, some products or features do not support delta lake (at least not yet), so I wanted to make you aware of those:

Serverless SQL pools do not support updating delta lake files. Use Azure Databricks or Apache Spark pools in Azure Synapse Analytics to update Delta Lake.

Within Power BI, there is a connector for Synapse (called “Azure Synapse Analytics SQL”) that can connect to an Azure Synapse serverless SQL pool, which can have a view that queries a delta table. However, you are limited to the compute offered by the serverless pool, and if that does not give you the performance you need or if you want direct control on the ability to scale up, you might want to look to instead use the “Azure Databricks” connector which will give you more compute (see Connecting Power BI to Azure Databricks). Note there is a new “Azure Synapse Analytics workspace (Beta)” connector in Power BI that can also query a delta table (see Supercharge BI insights with the new Azure Synapse Analytics workspace connector for Power Query and Azure Synapse Analytics workspace (Beta)), but that is also using serverless SQL pool compute and not Spark pool compute.

Note that an Azure Synapse serverless SQL pool can access data in a data lake, delta lake, and data in a spark table, called Lake database (but only if in Parquet or CSV format and NOT in delta lake format – see Azure Synapse Analytics shared metadata tables). An Azure Synapse Spark pool can access data in a data lake, delta lake, and a Lake database (any format, including delta lake). So if you are using a Lake database that is built on the delta lake format, you would not be able to use an Azure Synapse serverless SQL pool to query it, only a Azure Synapse Spark pool. Which also means if you are using the “Azure Synapse Analytics workspace (Beta)” connector in Power BI, you won’t see it display Lake database tables built on the delta lake format to connect to.

More info:

Exploring Delta Lake in Azure Synapse Analytics

The post Azure Synapse and Delta Lake first appeared on James Serra's Blog.

Original post (opens in new tab)
View comments in original post (opens in new tab)