When loading tables in Microsoft’s Parallel Data Warehouse (PDW) via T-SQL, one will soon encounter the need to load data into a new table. Whether the developer needs to perform an initial data load of table or even go through the PDW-specific steps to perform a ‘trunc and load’ of a table, there are a two table load methods to consider. Before we start you may be asking yourself – “What about SELECT INTO?” – SELECT INTO is not supported as of the AU3 realease of PDW
PDW T-SQL Load Methods:
- INSERT SELECT
- CREATE TABLE AS SELECT(CTAS)
To make a direct comparison between the two methods, I have provided the actual PDW query and execution plans generated by each method using identical SELECTs
Load Method: INSERT SELECT
Usage
Appends one or more new rows to a table. Add one row by listing the row values or add multiple rows by inserting the results of a SELECT statement.
Example
Insert Into TABLEA SELECT only, use, this, method, to, append FROM TABLEB
Generated Query Plan
Generated Execution Plan
Load Method: Create Table as Select (CTAS)
Usage
Creates a new table in SQL Server PDW that is populated with the results from a SELECT statement.
Example
CREATE TABLE myTable (
yearId int NOT NULL,
some varchar(50),
stuff varchar(50))
WITH
( CLUSTERED INDEX (yearId) );
Generated Query Plan
Generated Execution Plan
*When comparing to INSERT SELECT, notice the absence of the Index Scan and Parallelism steps above. CTAS results in an Index Seek for this particular SELECT and also does away with the Parallellism steps.
The Verdict – Use CREATE TABLE AS SELECT (CTAS) when loading empty tables!
Performance Results
CTAS loaded the test table more that twice as fast as INSERT SELECT.
- CTAS runtime: 1min 44sec
- INSERT SELECT runtime: 4min 2sec
Simple Explanation
On the PDW, INSERT SELECT is a fully logged transaction and occurs serially per each distribution. CTAS, however, is minimally logged and happens parellelly across all nodes.
NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.
Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:
– This will create stat for all columns on all objects
select ‘create statistics ‘ + b.name + ‘ on dbo.’ + a.name + ‘ (‘ + b.name + ‘)’
from sys.tables a, sys.columns b
where a.object_id = b.object_id and not exists (
select null from sys.stats_columns where object_id in (select object_id from sys.stats_columns group by object_id having count(*)>=1)
and object_id = b.object_id and column_id = b.column_id)
order by a.name, b.column_id;



Subscribe to this blog
Briefcase
Print




Loading comments...