Blog Post

Parallel Data Warehouse (PDW) Basics: CREATE TABLE and CTAS Syntax

,

After forgetting the syntax for temp tables a couple of times, I decided to write a brief overview of the table creation syntax for Microsoft’s Parallel Data Warehouse (PDW) architecture.

One thing worth noting up-front, this post does not include considerations that should be made in determinining whether a table should be distributed vs replicated.

TEMPORARY TABLE BASICS

-Temp tables will reside in the tempdb (no surprise there), however, the ## global prefix as we see it on SQL Server SMP is not available.

-Temp tables can be either distributed or replicated

-Temp tables cannot have partitions, views, or indexes

-Table permissions cannot be changed

-Visible in only the current session

- CONTROL, INSERT, SELECT and UPDATE permissions are granted to the temp table creator.

Example: Distributed temporary table

CREATE TABLE testDB.dbo.#TableA (
   here int NOT NULL,
   we varchar(50),
   go varchar(50))
WITH
   ( LOCATION = USER_DB,
    DISTRIBUTION = HASH (here) );

NON-TEMP TABLE BASICS

-Indexes cannot be added (or dropped) after a table has been created (must use ‘CREATE TABLE AS SELECT’ aka ‘CTAS’ syntax if the developer wants to preserve the underlying data).

-All tables are created with page compression, there is no configurable option for this.

-When creating a distributed table, keep in mind that partitions will exist in the 8 distributions per compute node. So if you want the math on that, an db may have 10 year partitions on a server - in PDW, this translates to 10(partitions) * 8 (distributions) = 80 partitions on each compute node.  If you have a Parallel Data Warehouse (PDW) ‘Full-Rack’, there will be a total of 800 partitions (80 partitions per node * 10 nodes) …  interesting nonetheless!

-The default collation for the PDW appliance is Latin1_General_100_CI_AS_KS_WS – this is configurable, however, use caution if you plan on modifying a column collation as it can lead to collation errors when comparing strings.

Example: Replicated Table with Index

CREATE TABLE myTable (
   yearId int NOT NULL,
   some varchar(50),
   stuff varchar(50))
WITH 
    ( CLUSTERED INDEX (yearId) );

Example: Distributed Table with Multiple Partitions

CREATE TABLE TableA(
   yearId int NOT NULL,
   some varchar(50),
   stuff varchar(50)
WITH     ( DISTRIBUTION = HASH (yearId),     
PARTITION ( YearPurchased     
RANGE RIGHT FOR VALUES       
( 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012 )));

Example:  Adding a new distribution column and clustered index to an existing table using ‘Create Table as Select’ aka ‘CTAS’

CREATE TABLE TableTemp WITH ( 
    DISTRIBUTION = HASH (newDistributionColumn),
    CLUSTERED INDEX (id ASC) )
AS SELECT * FROM TableOriginal;
DROP TABLE TableOriginal
RENAME OBJECT [TableTemp] TO [TableOriginal]

Below is the base syntax for creating tables as provided by Microsoft:

  CREATE TABLE [ database_name . [ dbo ] . | dbo. ] table_name
(
{ column_name
[ COLLATE Windows_collation_name ]
[ NULL | NOT NULL ] }
[ ,...n ]
)[ WITH ( <table_option> [ ,...n ] ) ]

Create a new temporary table.

CREATE TABLE [ database_name . [ dbo ] . | dbo. ] #table_name
(
{ column_name
[ COLLATE Windows_collation_name ]
[ NULL | NOT NULL ] }
[ ,...n ]
)
WITH ( LOCATION = USER_DB [, <table_option>
[ ,...n ] ] ) [;]
::=
datetimeoffset [ ( n ) ]
| datetime2 [ ( n ) ]
| datetime
| smalldatetime
| date
| time [ ( n ) ]
| float [ ( n ) ]
| real [ ( n ) ]
| decimal [ ( precision [ , scale ] ) ]
| money
| smallmoney
| bigint
| int
| smallint
| tinyint
| bit
| nvarchar [ ( n ) ]
| nchar [ ( n ) ]
| varchar [ ( n ) ]
| char [ ( n ) ]
| varbinary [ ( n ) ]
| binary [ ( n ) ]
::={[ CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] ) ] | [ DISTRIBUTION = { HASH( distribution_column_name ) | REPLICATE } ] | [ PARTITION( partition_column_name RANGE [ LEFT|RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ) ) ] }

NOTE: When performing CTAS or re-creating tables in any way, you will need to create statistics on those tables upon loading.

Ideally, you should create stats on all the join columns, group by, order by and restriction. SQL Server PDW does not automatically create and update statistics on the Control node for every occasion when SQL Server creates or updates statistics on the Compute nodes:

– This will create stat for all columns on all objects

select ‘create statistics ‘ + b.name + ‘ on dbo.’ + a.name + ‘ (‘ + b.name + ‘)’ from sys.tables a, sys.columns b where a.object_id = b.object_id and not exists ( select null from sys.stats_columns where object_id in (select object_id from sys.stats_columns group by object_id having count(*)>=1) and object_id = b.object_id and column_id = b.column_id) order by a.name, b.column_id;

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating