SQL on-demand in Azure Synapse Analytics


The new SQL on-demand (to be renamed SQL serverless) offering within Azure Synapse Analytics opens up a whole new world of getting insights into data immediately, at low-cost, using familiar tools/languages. There is no infrastructure to setup or clusters to maintain. It uses a pay-per-query model (you are only being charged per query for the data processed by the query – see pricing) and auto-scales depending on the amount of data being queried. See Quickstart: Use SQL on-demand

A SQL on-demand pool currently can access data from ADLS Gen2, Spark Tables, and Cosmos DB using T-SQL (click picture to enlarge):

Querying data in ADLS Gen2 storage using T-SQL is made easy because of the OPENROWSET function with additional capabilities (check out the T-SQL that is supported). The currently supported file types in ADLS Gen2 that SQL-on-demand can use are Parquet, CSV, and JSON. ParquetDirect and CSV 2.0 add performance improvements (see Benchmarking Azure Synapse Analytics – SQL Serverless, using .NET Interactive). You can also query folders and multiple files and use file metadata in queries.

Here are the three main scenarios that SQL on-demand is great for:

  • Basic discovery and exploration – Quickly view the data in various formats (Parquet, CSV, JSON) in your data lake, so you can plan how to extract insights from it
  • Logical data warehouse – Provide a relational abstraction on top of raw or disparate data without having to relocate or transform the data, allowing an always up-to-date view of your data. By putting T-SQL views on top of data in your data lake, this makes it appear to the end user that they are querying data in a relational database since they are using T-SQL, blurring the line between a relational database and a data lake
  • Data transformation – Simple, scalable, and performant way to transform data in the lake using T-SQL, so it can be fed to BI and other tools, or loaded into a relational data store (Synapse SQL databases, Azure SQL Database, etc.). For example, using the Copy activity in Azure Data Factory you can convert CSV files in the data lake (via T-SQL views in SQL on-demand) to Parquet files in the data lake.  See Azure Synapse Analytics: How serverless is replacing the data warehouse

Different professional roles can benefit from SQL on-demand:

  • Data Engineers can explore the lake, transform and prepare data using this service, and simplify their data transformation pipelines. For more information, check out this tutorial. You could even create a view over data in the data lake and use that to refresh your data in a tabular model
  • Data Scientists can quickly reason about the contents and structure of the data in the lake, thanks to features such as OPENROWSET and automatic schema inference
  • Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favorite tools, which can connect to SQL on-demand. So if you create a Spark Table, that table will be created as an external table in SQL On-Demand so you can query it without having to keep a Spark cluster up and running. Currently this is only supported with Spark Tables created using Parquet as the input format
  • BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables. You are connecting to a “database” that has no data, but rather views or external tables to data in the data lake. Not only can you create reports, but you can use Power BI to explore the data and do basic discovery

Here are the top FAQ’s about SQL on-demand that I have heard from customers:

  • Will SQL on-demand be able to query SQL Pool tables? It will be able to in the future
  • Is SQL serverless MPP? SQL serverless uses nodes to scale out processing, similar to MPP, although it is completely different architecture from existing MPP data warehouses.  It uses customized SQL engines
  • What are the warmup times when submitting a query? For 95 percentile it is less than 2s. There is a timeout defined after which there is scale-down, so if you don’t execute any query targeting storage during that timeout, your resources will be “taken from you”. Once you get back/connect and execute a query that targets storage, resources are granted back to your endpoint.  Maintained is a pool of warm clusters to get the quick warmup time
  • Is there a predicate pushdown concept for SQL on-demand in Azure Synapse? Yes, there is filter pushdown where SQL on-demand will push down queries from the front-end to back-end nodes. For example, if you query parquet files parquet metadata is used to target only column groups that contain values you are looking for. Microsoft is expanding the range of cases in which filter pushdown is supported
  • What are the best practices for SQL on-demand?  Check out https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-sql-on-demand
  • Why should I choose serverless and have a penalty on first (warmup) query, instead of using provisioned?  It depends on your workload. For constant or high workloads provisioned might be a better choice, while for sporadic or ad-hoc or exploratory workloads serverless might be better fit. It also brings a difference in charges: provisioned means you are paying for resources, while in serverless you are paying for what you use
  • So when would you use provisioned SQL (aka SQL Pool, which is exactly what was in SQL Data Warehouse)? When you need consistent performance, high performance, or have a large number of queries being consistently run. Provisioned SQL may give you better and more predictable performance due to resource reservation. But the good thing is because both use T-SQL, it is easy to transition back-and-forth between SQL Serverless and a SQL pool
  • Most customers are starting to use Delta Lake to store their data. When will SQL OD support this file format?  We are working on the support for Delta Lake, in the following months there will be more details regarding it and when will it be available
  • What is the ADLS Gen2 security? There are different methods to access storage account files using SQL serverless. You can use Synapse workspace MSI, SAS, or control ACL on storage level for a particular AAD account
  • Can we read from files stored in Common Data Model (CDM) format within the Data Lake (ADLS Gen2)?  Yes, at this moment we have a way to read CDM data using the SQL script or provisioned script, which gets metadata from the model.json and creates views over CDM data (native support in the query currently doesn’t exist)
  • Why use external tables over views?  Views are faster and have more features, so avoid external tables if possible. Particularly, virtual functions (filepath and filename are not supported in external tables at this moment which means users cannot do partition elimination on external tables)
  • Can you use PBI and AAS with SQL on-demand?  Yes, you can create views on top of your data lake, use wildcards and filepath/filename functions to expose partitioning column values in a view. That way, for PBI and AAS it is just another object (please make sure that you do proper data type casting of virtual function results as stated in best practices document)
  • Why use a logical data warehouse instead of just using AAS or PBI instead? Hot analytics is one reason as well as a logical data warehouse that can be used to speed up user request to delivery time.  Ideally, there would be no ETLs created (it might not be feasible in all cases, so providing another perspective here)
  • Is there any cost control? Yes, cost control will be available per workspace and time period daily/weekly/monthly. Also, cost estimation for later
  • What about CI/CD? When Synapse goes GA there will be improvements overall in CI/CD area, where customers can create group/department and scope the permissions of particular teams to the artifacts and capabilities
  • Can I use other tools with SQL on-demand? Yes, there is a default endpoint (i.e. workspacejs1966-ondemand.sql.azuresynapse.net) for this service that is provided within every Azure Synapse workspace. Any tool capable of establishing TDS connection SQL offerings can connect to and query Synapse SQL on-demand endpoint. You can connect with Azure Data Studio or SSMS and run ad-hoc queries or connect with Power BI to gain insights. You can also use Private Link to bring your SQL on-demand endpoint into your managed workspace VNet

More info:

Talking about Azure Synapse on Microsoft Mechanics!

Azure Synapse Analytics – Getting Started with SQL On Demand



Power BI and SQL on-demand

Original post (opens in new tab)
View comments in original post (opens in new tab)