August 30, 2010 at 11:52 am
I have a Deposits table which has approximately 270,000 rows in it. This table has a Clustered Index on DepositEntryId, which is an identity field (integer) I also have a TempDeposits table which has approximately 66,000 rows in it. It also has a DepositEntryId column (a subset of DepositEntryId's from the Deposit column) but it has no indexes on it. (This table is actually a table variable in a stored procedure, I just persisted it into a table to troubleshoot).
I ran the two queries below, hoping that the clustered index would be used. It wasn't. When I used an index hint (PK_Deposits, which is a clustered index), the performance got much worse. Below the queries, the execution plans are included.
select COUNT(*) from Deposits where DepositEntryId in (select DepositEntryId from TempDepositsTemp)
***********************************************************************
|--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[Expr1010],0)))
|--Stream Aggregate(DEFINE:([Expr1010]=Count(*)))
|--Hash Match(Right Semi Join, HASH:([AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp].[DepositEntryId])=([AfsDepositGatewayRefactored].[dbo].[Deposits].[DepositEntryId]), RESIDUAL:([AfsDepositGatewayRefactored].[dbo].[Deposits].[DepositEntryId]=[AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp].[DepositEntryId]))
|--Table Scan(OBJECT:([AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp]))
|--Index Scan(OBJECT:([AfsDepositGatewayRefactored].[dbo].[Deposits].[IX_Deposits_Item_n_ReceivedItemCount]))
************************************************************************
select COUNT(*) from Deposits with(index (PK_Deposits)) where DepositEntryId in (select DepositEntryId from TempDepositsTemp)
********************************************************************
|--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[globalagg1009],0)))
|--Stream Aggregate(DEFINE:([globalagg1009]=SUM([partialagg1008])))
|--Parallelism(Gather Streams)
|--Stream Aggregate(DEFINE:([partialagg1008]=Count(*)))
|--Hash Match(Right Semi Join, HASH:([AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp].[DepositEntryId])=([AfsDepositGatewayRefactored].[dbo].[Deposits].[DepositEntryId]), RESIDUAL:([AfsDepositGatewayRefactored].[dbo].[Deposits].[DepositEntryId]=[AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp].[DepositEntryId]))
|--Bitmap(HASH:([AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp].[DepositEntryId]), DEFINE:([Bitmap1010]))
| |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp].[DepositEntryId]))
| |--Table Scan(OBJECT:([AfsDepositGatewayRefactored].[dbo].[TempDepositsTemp]))
|--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([AfsDepositGatewayRefactored].[dbo].[Deposits].[DepositEntryId]))
|--Clustered Index Scan(OBJECT:([AfsDepositGatewayRefactored].[dbo].[Deposits].[PK_Deposits]), WHERE:(PROBE([Bitmap1010],[AfsDepositGatewayRefactored].[dbo].[Deposits].[DepositEntryId],N'[IN ROW]')))
************************************************************************
In the first query, the optimizer didn't use the clustered index, instead it used an index that was created on two integer columns (ItemCount and ItemReceivedCount). I don't understand how the optimizer decided to use IX_Deposits_Item_n_ReceivedItemCount index, which has nothing to do with the DepositEntryId column. (and I don't see any bookmark lookups). When I forced the usage of clustered index by using a hint, the logical reads hit the roof.
If someone could explain the behavior, I would appreciate it. If you need more information, please let me know. Thanks in advance
August 30, 2010 at 4:19 pm
Three issues:
1) Never use SELECT COUNT(*)...
Besides being lazy coding, SQL assumes you actually want to count all the columns. Since you only actually need DepositEntryId rather use SELECT COUNT(DepositEntryId) from...
2) Then you have a sub-query, these are very inefficient.
Rather write this as a join.
select COUNT(t.DepositEntryId) from Deposits d
join TempDepositsTemp t on d.DepositEntryId = t.DepositEntryId
3) Rather use a temp table than a table variable. Since SQL has no idea howmany rows you will insert into a table variable, it treats it a bit like a variable and assumes 1 row only. The parser is happy with this if you then only load a few hundred rows into the table variable, but if you add 66,000 rows the query plan is often not appropriate (i.e. useless). I'd also add a primary key to the temp table.
By adding the index hint you have forced SQL to do a Clustered index scan - i.e. scan the whole table, where in the first query the parser had worked out that going via the IX_Deposits_Item_n_ReceivedItemCount index was more efficient. I'd need to see the details of this index to explan it. e.g. does it have an INCLUDE column?
Cheers
Leo
Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.
August 30, 2010 at 4:33 pm
Leo.Miller (8/30/2010)
Three issues:1) Never use SELECT COUNT(*)...
Besides being lazy coding, SQL assumes you actually want to count all the columns. Since you only actually need DepositEntryId rather use SELECT COUNT(DepositEntryId) from...
Where the heck did you read that? Using count(*) allows the optimizer to pick whichever columns it wants to perform the count as fall as possible, so at worst, it's a tie for first position. Which is also why the first query beats the crap out of the second, a covering index is used rather than reading the whole table.
2) Then you have a sub-query, these are very inefficient.
Rather write this as a join.
select COUNT(t.DepositEntryId) from Deposits d
join TempDepositsTemp t on d.DepositEntryId = t.DepositEntryId
I smell a rat here. Most of the time what the programmer really needs to do is check for existance rather than getting the actual count. Can we see the real code rather than what you sent so we can figure out what you are trying to do?
3) Rather use a temp table than a table variable. Since SQL has no idea howmany rows you will insert into a table variable, it treats it a bit like a variable and assumes 1 row only. The parser is happy with this if you then only load a few hundred rows into the table variable, but if you add 66,000 rows the query plan is often not appropriate (i.e. useless). I'd also add a primary key to the temp table.
By adding the index hint you have forced SQL to do a Clustered index scan - i.e. scan the whole table, where in the first query the parser had worked out that going via the IX_Deposits_Item_n_ReceivedItemCount index was more efficient. I'd need to see the details of this index to explan it. e.g. does it have an INCLUDE column?
Cheers
Leo
It's also why temp tables are usually preffered to table variables. You can add more indexes and the server can create and use statistics which gives a much better shot a the most optimized plan.
August 31, 2010 at 1:53 am
Leo.Miller (8/30/2010)
Three issues:1) Never use SELECT COUNT(*)...
Besides being lazy coding, SQL assumes you actually want to count all the columns. Since you only actually need DepositEntryId rather use SELECT COUNT(DepositEntryId) from...
Um, no. Not at all.
SELECT * is lazy coding and returns all columns.
SELECT COUNT(*) tells SQL that you want to count all the ROWS, not all the columns. It will use the smallest index to scan so that it can get the row count as fast as possible.
Additionally, COUNT(*) and COUNT(DepositEntryId) have different meanings and are not equivalent (unless DepositEntryId is not nullable)
http://sqlinthewild.co.za/index.php/2009/04/14/on-counts/
2) Then you have a sub-query, these are very inefficient.
Rather write this as a join.
select COUNT(t.DepositEntryId) from Deposits d
join TempDepositsTemp t on d.DepositEntryId = t.DepositEntryId
Subqueries for an IN are not inefficient. Maybe that was true several versions ago, but not now. In fact, for a simple IN like that, the IN is likely faster than the join.
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 31, 2010 at 1:58 am
onurcivi (8/30/2010)
In the first query, the optimizer didn't use the clustered index, instead it used an index that was created on two integer columns (ItemCount and ItemReceivedCount). I don't understand how the optimizer decided to use IX_Deposits_Item_n_ReceivedItemCount index, which has nothing to do with the DepositEntryId column.
To do a count(*), SQL looks for the smallest index that it can use. The clustered index, by definition, is the largest index on a table (because it has all the columns at the leaf level).
The clustering key is included in all nonclustered indexes, so that index it chose to scan was likely the smallest one on the table and, because the clustering key was present, it could do the count and the filter as it scanned the index.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
August 31, 2010 at 9:14 am
Thanks for the replies. I tried using a join instead of a sub-query, but I got the same execution plan. I am posting the statements from the actual stored proc. The statements I posted yesterday were attempts to understand\troubleshoot what I was seeing,
Here is what we are trying to do:
ALTER proc [dbo].[MarkDepositsForExportJob]
@ExportJobId uniqueidentifier,
@ExportConfigId uniqueidentifier,
@MarkedDepositCount int output
as
set @MarkedDepositCount = 0
declare @tempDeposits table
(
DepositEntryId int,
DepositId uniqueidentifier,
ItemCount int
)
insert into @tempDeposits (DepositEntryId, DepositId, ItemCount)
select DepositEntryId, DepositId, ItemCount from Deposits
where --- insert business rules conditions here, not having any problems with this select statement
-- If nothing to export then we are finished
select @MarkedDepositCount = count(DepositEntryId) from @tempDeposits
if (@MarkedDepositCount = 0) return
-- Update actual Deposits table now for all exportable deposits (reset FailedExport Status).
update Deposits set ExportJobId = @ExportJobId, FailedExport = null
where DepositEntryId in (select DepositEntryId from @tempDeposits)
*****************************************************
Having a clustered index on DepositEntryId is hurting the performance of the update statement. I was also told by implementation guys that the max number of items in the temp table should be less than 5000 in production environment.
Again, thanks for the feedback
September 1, 2010 at 3:25 pm
onurcivi (8/31/2010)
..declare @tempDeposits table(
DepositEntryId int,
DepositId uniqueidentifier,
ItemCount int
)
insert into @tempDeposits (DepositEntryId, DepositId, ItemCount)
select DepositEntryId, DepositId, ItemCount from Deposits
where --- insert business rules conditions here, not having any problems with this select statement
-- If nothing to export then we are finished
select @MarkedDepositCount = count(DepositEntryId) from @tempDeposits
if (@MarkedDepositCount = 0) return
-- Update actual Deposits table now for all exportable deposits (reset FailedExport Status).
update Deposits set ExportJobId = @ExportJobId, FailedExport = null
where DepositEntryId in (select DepositEntryId from @tempDeposits)
*****************************************************
Having a clustered index on DepositEntryId is hurting the performance of the update statement. I was also told by implementation guys that the max number of items in the temp table should be less than 5000 in production environment.
This is very different from what you originally posted.
Is there a partcular reason for doing this in 2 steps?
Why couldn't you do the update based directly on the business rules?
In theory clustered indexes improve performance, but you need to pick the correct column to cluster on without knowing more about the table, data and types of queries it's difficult to comment on this index being appropriate.
I've never heard that temp tables should be less than 5000 rows.
If you really want to use the temp table, why don't you just use @@rowcount to see if it has any rows?
Cheers
Leo
Leo
Nothing in life is ever so complicated that with a little work it can't be made more complicated.
Viewing 7 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply