Query Acceleration for ADLS


Just announced is Query Acceleration for Azure Data Lake Storage Gen2 (ADLS) as well as Blob Storage. This is a new capability for ADLS that enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation from storage. This reduces the time and processing power that is required to query stored data.

For example, if an application will execute a SELECT statement that filters columns and rows from a csv file, instead of all pulling the entire csv file over the network into the application and then filtering the data, it will instead do the filtering at the time the data is read from the disk, so that only the filtered data is transferred over the network to the application. So if you have a csv file with 50 columns and 1 million rows, but the filters limit the data to 5 columns and 1000 rows, then only the 5 columns and 1000 rows will be retrieved from the disk and sent over the network to the application.

It accomplishes this by pushing-down predicates and column projections, so they may be applied at the time data is first read, enabling applications to filter rows and columns at the time that data is read from disk so that all downstream data handling is saved from the cost of filtering and processing unrequired data. This improves network latency and compute cost (an analysis showed that 80% of data is needlessly transferred across the network, parsed, and filtered by applications). Also, the CPU load that is required to parse and filter unneeded data requires your application to provision a greater number and larger VMs in order to do it’s work. By transferring this compute load to query acceleration, applications can realize significant cost savings.

You can use SQL to specify the row filter predicates and column projections in a Query Acceleration request. A request processes only one file. Therefore, advanced relational features of SQL, such as joins and the GROUP BY aggregate, aren’t supported. Query acceleration supports CSV and JSON formatted data as input to each request.

The following diagram illustrates how a typical application uses Query Acceleration to process data:

How a typical application uses Query Acceleration to process data:
  1. The client application requests file data by specifying predicates and column projections
  2. Query Acceleration parses the specified query and distributes work to parse and filter data
  3. Processors read the data from the disk, parse the data by using the appropriate format, and then filter data by applying the specified predicates and column projections. Azure Storage already consists of a non-trivial amount of compute to implement all of the storage functionality (eg. serve requests, encrypt/decrypt, attach JBODs, etc.). Query Acceleration simply is allocated a quota of this resource to do its varied jobs
  4. Query Acceleration combines the response shards to stream back to client application
  5. The client application receives and parses the streamed response. The application doesn’t need to filter any additional data and can apply the desired calculation or transformation directly

Query acceleration supports an ANSI SQL-like language for expressing queries over blob contents. The query acceleration SQL dialect is a subset of ANSI SQL, with a limited set of supported data types, operators, etc., but it also expands on ANSI SQL to support queries over hierarchical semi-structured data formats such as JSON.

Right now this works via Java and .NET, and in the future it will work for such tools as Python and Azure Synapse Analytics (Microsoft is actively working with OSS and commercial partners to integrate Query Acceleration into these frameworks, as appropriate).

Due to the increased compute load within the ADLS service, the pricing model for using query acceleration differs from the normal ADLS transaction model. Query acceleration charges a cost for the amount of data scanned as well as a cost for the amount of data returned to the caller.

Despite the change to the billing model, Query acceleration’s pricing model is designed to lower the total cost of ownership for a workload, given the reduction in the much more expensive VM costs.

To find out more about Query Acceleration for Azure Data Lake Storage you can:

Original post (opens in new tab)
View comments in original post (opens in new tab)