http://www.sqlservercentral.com/blogs/salvoz-sql/2012/12/11/optimizing-your-bi-semantic-model-for-performance-and-scale/ Printed 2014/09/18 12:16AM
Optimizing Your BI Semantic Model for Performance and Scale
Here’s my notes from the ‘Optimizing Your BI Semantic Model for Performance and Scale (BIA-402-M)’ session at PASS Summit, presented by Akshai Mirchandani and Allan Folting of Microsoft.
Tabular Design Principles
- Speed and performance
- Optimize query performance over processing performance, focus on end-user experience.
- Accommodate changes without forcing reload, if possible
Tabular Model Overview
- Vertipaq is a column based data store.
- Column based data store is good for analytical queries, can ignore columns that aren’t needed for analysis.
- Column based data store provides better compression = faster performance.
- Each column has multiple data structures associated to it (calculated columns are treated like regular columns).
- Data Structures
- Dictionary: Stores unique/distinct values found in a column.
- Segments: Store integer values associated with the dictionary values. A segment is the data size per read or unit of Vertipaq operation.
- Hierarchies: Have separate data storage for efficiency, provide quicker access for querying.
- Relationships: Accelerate lookups across tables.
- Partitions: Contain a group of segments. Can be created on any table, no distinction between dimension and fact tables. Partitioning in tabular is for data management purposes (ability to process partial data, expire old data, etc.).
Storage on Disk
Map column values to data ID. Advantage of data ID is compression.
Two types of encoding:
Arithmetic operation to calculate data ID from column value.
Great for dense value distribution.
Allows computation on data IDs.
Stored in metadata.
Hash table to map values to data ID
Great for sparse value distribution.
Requires hash lookup for computation.
Persisted on disk.
Unique Identifier columns: don’t get much benefit from encoding. Max number of distinct values allowed is 2 billion.
System chooses the encoding type for each column, this is not configurable.
Encoding selection (selecting between value or hash)
Sample rows and look at values (Sparse?, Dense?, String?)
Re-encoding may occur during processing. May change encoding to hash type if it is determined that value encoding won’t work. Can be VERY expensive, requires all processed segments to be re-encoded.
- System Views specific to Tabular (documented on MSDN)
Read and encode data for segment N. Segment data is loaded sequentially in tabular.
Compress segment N and read and encode data for segment N + 1 (happen in parallel). Compression is a greedy-algorithm.
Continue for all remaining segments…
Other structures built at end (calculated columns, hierarchies, etc.).
Create new partitions for new data
ProcessData on new partition or to reload into existing tables
ProcessRecalc to rebuild calculated columns, hierarchies, relationships.
- Two formula engines: MDX and DAX
- Formula engine (FE) calls into Vertipaq to retrieve data
- Query logic pushed to Vertipaq where possible
- Vertipaq query executes in parallel, one core per segment
- Optimized for execution on compressed format
- VertiPaq level caching for chatty FE/Vertipaq communication
- DAX/MDX vs. Vertipaq Query
- Single-threaded per query
- Designed for expressivity
- One core per segment
- Heavily optimized for speed
- Simpler queries pushed to Vertipaq from formula engine.
- Will become more powerful over time
- Inspecting Query Performance
- Capacity Planning
- Ballpark of 3/10th of source data
- Fully reprocessing a model will require double that (old version that queries are still working against and new version being built during processing)
- Need to ensure enough memory available for queries, depends on query patterns (granularity, filters)
- High level estimation: 1/2 of source data, to be safe. Paging is also possible, but not recommended
- Lower end hardware is often better. Newer/latest cores often take longer to reach high-end
- Look for faster cores, better cache systems