RE: Primary Key creation - before and after bulk insert

SSC Guru

Points: 150467

August 13, 2010 at 8:21 pm

#1206983

Hey Arjun,

There are a few important things here that determine the fastest method:

First, you must avoid an explicit sort operation if possible. In your first script (where the clustered index exists before the load) you are loading the result of a ROW_NUMBER ranking function into an INTEGER column. ROW_NUMBER returns BIGINT; the type mismatch is enough to confuse the optimiser in thinking it needs to sort the results before inserting into the table.

Second, try to achieve a minimally-logged load. In SQL Server 2005, an INSERT...SELECT statement is always fully logged. The minimally-logged alternative is SELECT...INTO. This does mean creating the clustered index as a second operation, but the benefit of minimal logging more than makes up for this overhead. Just to confuse matters, SQL Server 2008 can do a minimally-logged INSERT...SELECT in many circumstances.

Finally, try not to use master.dbo.spt_values - use a documented system table instead. The reason is that the internals of spt_values changed significantly in SQL Server 2008, making it a very slow way to grab rows.

The fastest method (for me) depends on which version of SQL Server I use. The lack of a minimally-logged INSERT in SQL Server 2005 means that SELECT...INTO followed by a clustered index build is faster. The improvements in SQL Server 2008 make INSERT...SELECT (with an existing clustered index) faster.

So, my 'improved' version of the first query (pre-existing index) is:

--

-- Load into an existing index

-- (fastest on 2008)

--

IF OBJECT_ID(N'tempdb..#Tally', N'U')

IS NOT NULL

BEGIN

DROP TABLE #Tally;

END;

GO

DECLARE @StartTime DATETIME;

SET @StartTime = GETDATE();

CREATE TABLE #Tally

(

N BIGINT NOT NULL PRIMARY KEY CLUSTERED

);

INSERT #Tally

(N)

SELECT TOP (10000)

ROW_NUMBER() OVER (ORDER BY (SELECT 1))

FROM master.sys.allocation_units A,

master.sys.allocation_units B;

SELECT [INSERT...SELECT] = DATEDIFF(MILLISECOND, @StartTime, GETDATE());

GO

DROP TABLE #Tally;

'Improved' version of the second query (load then index) is:

--

-- Load then create index

-- (fastest on 2005)

--

IF OBJECT_ID(N'tempdb..#Tally', N'U')

IS NOT NULL

BEGIN

DROP TABLE #Tally;

END;

GO

DECLARE @StartTime DATETIME;

SET @StartTime = GETDATE();

SELECT TOP (10000)

ISNULL(ROW_NUMBER() OVER(ORDER BY (SELECT 1)), -1) AS N

INTO #Tally

FROM master.sys.allocation_units A,

master.sys.allocation_units B;

ALTER TABLE #Tally

ADD PRIMARY KEY CLUSTERED (N)

WITH

FILLFACTOR = 100;

SELECT [SELECT...INTO] = DATEDIFF(MILLISECOND, @StartTime, GETDATE());

GO

DROP TABLE #Tally;

Typical run times on 2008:

Query 1: 13ms

Query 2: 23ms

Typical run times on 2005:

Query 1: 96ms

Query 2: 20ms

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi