Temp Table's Primary Key Default

Question

Temp Table's Primary Key Default

Viewing 15 posts - 16 through 30 (of 31 total)

You must be logged in to reply to this topic. Login to reply

rchantler SSCrazy Points: 2128 More actions · Answer 1

First, thanks to all who have taken an interest in this.

I've played around with this code. One observation - any form of UNIQUE constraint on the temp table results in the poor execution plan. Leaving the temp table as a heap gives the quickest performance. I've boiled the case down to essentials; here is the SQL code in two parts. The first part generates the test data; the second part runs the query. You'll need a tally table to help generate the test data.

-- Note: You will need a tally table to create the test data

-- Early data is monthly; later data is daily

-- Define a mapping table [DateMap] that maps early monthly data to daily dates

-- and later daily data to daily dates so data can be pulled for daily

-- dates over the entire date range

IF OBJECT_ID('dbo.DateMap') IS NOT NULL drop table dbo.DateMap

CREATE TABLE [dbo].[DateMap]

(

[adate] [smalldatetime] NOT NULL,

[calcdate] [smalldatetime] NULL)

INSERT INTO dbo.DateMap(adate)

SELECT CAST(N AS SMALLDATETIME)

FROM dbo.tally2 -- First use of Tally Table

WHERE N BETWEEN 29249 AND 42275

UPDATE dbo.DateMap

SET calcdate = adate

WHERE adate >= '31-aug-2008'

-- Define two CTEs

;WITH datecalc AS(

SELECT *, y= Year(adate), m= Month(adate), rn = ROW_NUMBER()OVER(ORDER BY adate)

FROM dbo.DateMap)

, monthends AS(

SELECT DC1.*

FROM datecalc DC1

JOIN datecalc DC2

ON DC1.rn = DC2.rn-1

WHERE DC1.m <> DC2.m)

UPDATE DM

SET calcDate = X.adate

FROM dbo.dateMap DM

OUTER APPLY( SELECT TOP 1 adate

FROM monthends M

WHERE DM.adate >= M.adate

ORDER BY M.adate DESC

) X

WHERE DM.calcDate IS NULL

ALTER TABLE dbo.DateMap ADD CONSTRAINT PK_DateMap PRIMARY KEY CLUSTERED(Adate ASC)

CREATE NONCLUSTERED INDEX IX_DateMap_calcdate_adate ON dbo.DateMap(CalcDate ASC, Adate ASC)

-- Create the test data

IF OBJECT_ID('dbo.TestCalcs') IS NOT NULL drop table dbo.TestCalcs

CREATE TABLE dbo.TestCalcs(adate SMALLDATETIME NOT NULL,

hid INT NOT NULL,

value FLOAT NULL,

PRIMARY KEY (adate ASC, hid ASC))

INSERT INTO dbo.testCalcs

SELECT X.adate,Y.hid, RAND(Y.hid)

FROM (select distinct calcdate as adate from dbo.dateMap) X

CROSS JOIN (select top 4000 N as hid from dbo.tally2 order by N) Y -- second use of tally table

CREATE NONCLUSTERED INDEX IX_testCalcs ON dbo.testCalcs(hid ASC, adate ASC)

-- Create the view that joins the test data to the mapping table

IF OBJECT_ID('dbo.TestCalcsView') IS NOT NULL drop view dbo.TestCalcsView

GO

CREATE VIEW dbo.testCalcsView AS

SELECT

DM.adate,

CL.hid,

CL.value

FROM

rc_temp.dbo.DateMap DM

JOIN

rc_temp.dbo.testCalcs CL

ON DM.CalcDate = CL.Adate

GO

Once the test data is generated, here is the test run code.

-- Test Run

DBCC FREEPROCCACHE

DBCC DROPCLEANBUFFERS

-- Temp tables

IF OBJECT_ID('tempdb..#hidvals') IS NOT NULL DROP TABLE #hidvals

CREATE TABLE #hidvals(hid INT NOT NULL,

-- hid INT NOT NULL UNIQUE,

extel2code CHAR(2),

start SMALLDATETIME,

finish SMALLDATETIME,

value1 FLOAT,

value2 FLOAT)

-- CONSTRAINT hidvals_unique_hid UNIQUE (hid))

-- PRIMARY KEY CLUSTERED(hid))

--CREATE UNIQUE CLUSTERED INDEX IDX_hidvals ON #hidvals(hid)

INSERT INTO #hidvals(hid, start, finish)

SELECT hid ,'17-sep-2015', '18-sep-2015'

FROM testCalcs

WHERE adate = '18-sep-2015'

-- UPDATE STATISTICS #hidvals -- this makes no differen

-- Here's the statement with the slow/fast plan (depending)

UPDATE H

SET value1 = X1.[value], value2 = X2.[value]

FROM #hidvals H

LEFT JOIN rc_temp.dbo.testCalcsView X1

ON H.hid = X1.hid AND H.start = X1.adate

LEFT JOIN rc_temp.dbo.testCalcsView X2

ON H.hid = X2.hid AND H.finish = X2.adate

Kim Crosser SSCommitted Points: 1763 More actions · Answer 2

Kristen-173977 (9/29/2015)
rchantler (9/28/2015)
CREATE CLUSTERED INDEX IDX_hidvals ON #hidvals(hid)
That is not declared as UNIQUE, so not quite the same thing as a PRIMARY KEY's Clustered Index (SQL will add a tie-break value which will make the key size wider).
Also, Primary Key will not allow NULLs which might effect how the optimiser treats the index - although you have declared [hid] as NOT NULL so maybe that is as broad as it is long ...

Maybe I am missing something, but I think Kristen hit the real issue here. A Clustered Index does NOT imply uniqueness, while a Primary Key does. So when you are trying to update the table with a Primary Key definition, SQL Server must test each new record for non-existence in the table.

Simply defining an index as Clustered just causes efficient organization of the key/data. SQL Server doesn't need to test each new value to see if it is already in the table.

Unless I missed a test in there somewhere, did you try defining that index as a UNIQUE CLUSTERED index? I would expect that to have the same performance as the Primary Key.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 3

Kim Crosser (10/8/2015)
Kristen-173977 (9/29/2015)
rchantler (9/28/2015)
CREATE CLUSTERED INDEX IDX_hidvals ON #hidvals(hid)
That is not declared as UNIQUE, so not quite the same thing as a PRIMARY KEY's Clustered Index (SQL will add a tie-break value which will make the key size wider).
Also, Primary Key will not allow NULLs which might effect how the optimiser treats the index - although you have declared [hid] as NOT NULL so maybe that is as broad as it is long ...
Maybe I am missing something, but I think Kristen hit the real issue here. A Clustered Index does NOT imply uniqueness, while a Primary Key does. So when you are trying to update the table with a Primary Key definition, SQL Server must test each new record for non-existence in the table.
Simply defining an index as Clustered just causes efficient organization of the key/data. SQL Server doesn't need to test each new value to see if it is already in the table.
Unless I missed a test in there somewhere, did you try defining that index as a UNIQUE CLUSTERED index? I would expect that to have the same performance as the Primary Key.

The script posted by rchantler (thanks, very handy) demonstrates the effect nicely. I've made a number of changes to it to eliminate some chaff and get a better idea of the problem domain. Both queries now use nonclustered indexes except of course for the index on the temp table, and the problem remains. The index on the temp table is now just a clustered index. If you comment out either one of the two left joins, or change either of them to inner joins, the problem goes. If you add [start] or [finish] (or both) to the key list of the clustered index on the temp table as the second key, the problem changes but it's not fixed.

Note that #hidvals.start has the same value for every row, as does #hidvals.finish.

Creating statistics on [start] / [finish] doesn't change anything, the problem remains. Here's the modified script:

-- Note: You will need a tally table to create the test data

-- Early data is monthly; later data is daily

-- Define a mapping table [DateMap] that maps early monthly data to daily dates

-- and later daily data to daily dates so data can be pulled for daily

-- dates over the entire date range

-- create tally table

IF OBJECT_ID('TempDB..#tally2') IS NOT NULL drop table #tally2

;WITH Ten AS (SELECT * FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) d (n))

SELECT n = ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) INTO #tally2 FROM Ten a, Ten b, Ten c, Ten d, Ten e

CREATE UNIQUE CLUSTERED INDEX cx_tally ON #tally2 (n)

GO

IF OBJECT_ID('dbo.DateMap') IS NOT NULL drop table dbo.DateMap

CREATE TABLE [dbo].[DateMap]

(

[adate] [smalldatetime] NOT NULL,

[calcdate] [smalldatetime] NULL,

Guid1 UNIQUEIDENTIFIER,

Guid2 UNIQUEIDENTIFIER,

Guid3 UNIQUEIDENTIFIER

)

INSERT INTO dbo.DateMap(adate, Guid1, Guid2, Guid3 )

SELECT CAST(N AS SMALLDATETIME), NEWID(), NEWID(), NEWID() -- make table wider to encourage use of index CJM

FROM #tally2 -- First use of Tally Table

WHERE N BETWEEN 29249 AND 42275

UPDATE dbo.DateMap

SET calcdate = adate

WHERE adate >= '31-aug-2008'

-- Define two CTEs

;WITH datecalc AS(

SELECT *, y= Year(adate), m= Month(adate), rn = ROW_NUMBER()OVER(ORDER BY adate)

FROM dbo.DateMap)

, monthends AS(

SELECT DC1.*

FROM datecalc DC1

JOIN datecalc DC2

ON DC1.rn = DC2.rn-1

WHERE DC1.m <> DC2.m)

UPDATE DM

SET calcDate = X.adate

FROM dbo.dateMap DM

OUTER APPLY( SELECT TOP 1 adate

FROM monthends M

WHERE DM.adate >= M.adate

ORDER BY M.adate DESC

) X

WHERE DM.calcDate IS NULL

--ALTER TABLE dbo.DateMap ADD CONSTRAINT PK_DateMap PRIMARY KEY CLUSTERED(Adate ASC)

CREATE UNIQUE CLUSTERED INDEX UCX_DateMap ON dbo.DateMap (Guid1)

CREATE UNIQUE NONCLUSTERED INDEX IX_DateMap_calcdate_adate ON dbo.DateMap (CalcDate ASC, Adate ASC)

CREATE UNIQUE NONCLUSTERED INDEX IX_DateMap_adate ON dbo.DateMap (Adate ASC) INCLUDE (CalcDate)

GO

-- Create the test data

IF OBJECT_ID('dbo.TestCalcs') IS NOT NULL drop table dbo.TestCalcs

CREATE TABLE dbo.TestCalcs(adate SMALLDATETIME NOT NULL,

hid INT NOT NULL,

value FLOAT NULL)

INSERT INTO dbo.testCalcs

SELECT X.adate,Y.hid, RAND(Y.hid)

FROM (select distinct calcdate as adate from dbo.dateMap) X

CROSS JOIN (select top 4000 N as hid from #tally2 order by N) Y -- second use of tally table

CREATE UNIQUE CLUSTERED INDEX UCX_testCalcs_adate_hid ON dbo.testCalcs (adate, hid)

CREATE UNIQUE NONCLUSTERED INDEX IX_testCalcs_hid_adate ON dbo.testCalcs(hid ASC, adate ASC) INCLUDE (value)

CREATE UNIQUE NONCLUSTERED INDEX IX_testCalcs_adate_value ON dbo.testCalcs(adate ASC, value) INCLUDE (hid)

-- Create the view that joins the test data to the mapping table

IF OBJECT_ID('dbo.TestCalcsView') IS NOT NULL drop view dbo.TestCalcsView

GO

CREATE VIEW dbo.testCalcsView AS

SELECT

DM.adate,

CL.hid,

CL.value

FROM dbo.DateMap DM

JOIN dbo.testCalcs CL

ON DM.CalcDate = CL.Adate

GO

SELECT COUNT(*) FROM dbo.testCalcsView -- 52,108,000 rows

--=============================================================================================================================================

-- Temp table

IF OBJECT_ID('tempdb..#hidvals') IS NOT NULL DROP TABLE #hidvals

CREATE TABLE #hidvals(hid INT NOT NULL,

extel2code CHAR(2),

start SMALLDATETIME,

finish SMALLDATETIME,

value1 FLOAT,

value2 FLOAT)

INSERT INTO #hidvals(hid, start, finish)

SELECT hid ,'17-sep-2015', '18-sep-2015'

FROM testCalcs

WHERE adate = '18-sep-2015'

CREATE UNIQUE CLUSTERED INDEX ucx_hidvals ON #hidvals(hid)

CREATE STATISTICS stx_finish ON #hidvals (finish)

CREATE STATISTICS stx_start ON #hidvals (start)

CREATE STATISTICS stx_startfinish ON #hidvals (start,finish)

UPDATE STATISTICS testCalcs

UPDATE STATISTICS DateMap

--=======================================================================================================

DROP INDEX ucx_hidvals ON #hidvals

SELECT -- Q1, no index

value1 = X1.[value],

value2 = X2.[value]

FROM #hidvals H

LEFT JOIN dbo.testCalcsView X1

ON H.hid = X1.hid AND H.start = X1.adate

left JOIN dbo.testCalcsView X2

ON H.hid = X2.hid AND H.finish = X2.adate

--CREATE UNIQUE CLUSTERED INDEX ucx_hidvals ON #hidvals(hid)

CREATE UNIQUE CLUSTERED INDEX ucx_hidvals ON #hidvals(hid,finish)

SELECT -- Q2, unique clustered index

value1 = X1.[value],

value2 = X2.[value]

FROM #hidvals H

LEFT JOIN dbo.testCalcsView X1

ON H.hid = X1.hid AND H.start = X1.adate

left JOIN dbo.testCalcsView X2

ON H.hid = X2.hid AND H.finish = X2.adate

--=======================================================================================================

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

rchantler SSCrazy Points: 2128 More actions · Answer 4

Yes, there are at least 2 questions - 1. why the poor performance with adding a UNIQUE constraint and 2. why does switching from left join to inner join result in good performance even with the UNIQUE constraint.

Seems to me 'the answer' would explain both questions.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 5

rchantler (10/9/2015)
Yes, there are at least 2 questions - 1. why the poor performance with adding a UNIQUE constraint and 2. why does switching from left join to inner join result in good performance even with the UNIQUE constraint.
Seems to me 'the answer' would explain both questions.

It doesn't have to be a constraint. It doesn't have to be a clustered index either - an ordinary index (with no clustered index on the table) like this reproduces the effect too:

CREATE UNIQUE INDEX ucx_hidvals ON #hidvals(hid) INCLUDE (start, finish)

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

rchantler SSCrazy Points: 2128 More actions · Answer 6

You can even define the field to be UNIQUE in the table definition to get the effect.

rchantler SSCrazy Points: 2128 More actions · Answer 7

rchantler

SSCrazy

Points: 2128

October 9, 2015 at 8:03 am

#1832546

Anything that results in uniqueness, it seems.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 8

rchantler (10/9/2015)
You can even define the field to be UNIQUE in the table definition to get the effect.

That's enforced by a UNIQUE index, so it's the same as we have now. A unique index, clustered or not.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 9

ChrisM@Work (10/9/2015)
rchantler (10/9/2015)
You can even define the field to be UNIQUE in the table definition to get the effect.
That's enforced by a UNIQUE index, so it's the same as we have now. A unique index, clustered or not.

Also, you don't need an index on the temp table (or a temp table, it can be permanent) to switch the effect on and off - you can use statistics objects on a heap.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Kim Crosser SSCommitted Points: 1763 More actions · Answer 10

It looks to me like the following is occurring.

Where there is no Unique index (or constraint) on the HID column (and/or the statistics don't show that the HID column is a nice flat histogram), the query optimizer decides to do a table scan on the table, which appears to be pretty efficient. (i.e., read a row, compare all three fields and update in one pass.) The table shown is pretty compact, and without knowing the probable percentage of rows to be updated in advance, a table scan may be the best solution.

However, when the Uniqueness constraint is defined, or the statistics show the flat distribution of HID, the optimizer decides that the index is worth using. Unfortunately, this causes lots of index seeks, followed by row fetches of a lot of rows in order to compare the datetime columns (it may even wind up doing two data row fetches per actual data row as a result of the two separate left joins on the different datetime columns.

I would bet that either of the following would perform very well:

create unique index IDX_Hidvals_hid_start on #hidvals(hid,start);

create unique index IDX_Hidvals_hid_finish on #hidvals(hid,finish);

or

create unique index IDX_Hidvals_hid_start on #hidvals(hid) include (start);

create unique index IDX_Hidvals_hid_finish on #hidvals(hid) include (finish);

Either of these would eliminate the need to do the data row fetches to validate the "start" or "finish" criteria, and should be at least as fast as the non-unique clustered index (probably a lot faster - assuming that the update isn't updating a large percentage of the table rows).

Clustering one of the two indexes might or might not make a difference - what is ordered well by HID+Start may not be ordered well for a HID+Finish query, although if the dates "tend" to be similarly ordered, then I would cluster the HID+Start index.

rchantler SSCrazy Points: 2128 More actions · Answer 11

Thanks Kim,

With either of both indexes in your first set of 2 I see quick execution i.e. the good plan.

With either of both indexes in your second set of 2 I get slow execution i.e. the poor plan.

Based on the estimated subtree cost the heap is still the fastest.

rchantler SSCrazy Points: 2128 More actions · Answer 12

That last reply should say 'with either or both" not "of both"

Kim Crosser SSCommitted Points: 1763 More actions · Answer 13

I should have been clearer - I would recommend creating BOTH indexes in either of the two sets, not just one or the other. Because the query has two outer joins, one on HID+Start and one on HID+Finish, I would create the pair of indexes (in either style - although the first ones with the composite keys are probably always going to be better in this case).

(edited)

BTW - the (possibly?) optimum configuration is probably:

create unique clustered index IDX_Hidvals_hid_start on #hidvals(hid,start);

create unique index IDX_Hidvals_hid_finish on #hidvals(hid,finish);

This will organize the data by HID+Start as well as creating efficient indexes.

rchantler SSCrazy Points: 2128 More actions · Answer 14

I think I have at least a partial explanation re the slow/fast query.

The query produces two different execution plans. If table #hidvals is a heap, the optimizer generates plan A. If table #hidvals has any form of UNIQUE constraint on the hid field the optimizer generates plan B.

Using the original test data plan A runs far quicker than plan B. The question is, given the optimizer has enough scope to choose either plan A or B, why does it choose the much slower plan B when #hidvals is not a heap?

The explanation is the original data in table #hidvals is not random. I.e. #hidvals has 2 date fields, start and finish. In the original formulation finish has the same single value for each row in the table and start has the same single value (the date before finish) for each row.

The optimizer does not seem to use the fact that each date field contains only one value.

Once the optimizer creates the ‘best heap plan’, when it runs against the special case dates it finishes in an instant. However, the heap plan does not perform so well against randomized dates.

The ‘best non-heap plan’ has comparable performance whether the dates are special case or random.

In fact the ‘best non-heap plan’ outperforms the ‘best heap plan’ (by some measures) running against randomized dates.

ChrisM@Work SSC Guru Points: 186127 More actions · Answer 15

rchantler (10/23/2015)
I think I have at least a partial explanation re the slow/fast query.
The query produces two different execution plans. If table #hidvals is a heap, the optimizer generates plan A. If table #hidvals has any form of UNIQUE constraint on the hid field the optimizer generates plan B.
Using the original test data plan A runs far quicker than plan B. The question is, given the optimizer has enough scope to choose either plan A or B, why does it choose the much slower plan B when #hidvals is not a heap?
The explanation is the original data in table #hidvals is not random. I.e. #hidvals has 2 date fields, start and finish. In the original formulation finish has the same single value for each row in the table and start has the same single value (the date before finish) for each row.
The optimizer does not seem to use the fact that each date field contains only one value.
Once the optimizer creates the ‘best heap plan’, when it runs against the special case dates it finishes in an instant. However, the heap plan does not perform so well against randomized dates.
The ‘best non-heap plan’ has comparable performance whether the dates are special case or random.
In fact the ‘best non-heap plan’ outperforms the ‘best heap plan’ (by some measures) running against randomized dates.

Eirikur and I have spent more than a little time investigating this and we haven't finished yet. It's almost certainly a cardinality estimator bug since a) the estimates appear to be off by three orders of magnitude when using the unique index vs the non-unique index (or equivalent statistics objects) and b) the same effect isn't observed in SQL Server 2014, which has a new cardinality estimator.

We have a simplified pair of plans (queries) which show the effect more clearly than the original but need a little more time to reduce noise from the sample data set.

^{“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw}

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

Temp Table's Primary Key Default

Cookies on SQLServerCentral