different estimated rows on same index operation?

  • hello sql-friends,

    the question is about index statistics, this is why I chose the performance forum, not the SQL-forum.

    introduction and background

    I had to optimize a simple query (example below). After rewriting it several times I recognized that the estimated row count on the one and same index operation differs depending on the way the query is written.

    Originally the query did a clustered index scan, as the table in production contains a binary column the table is quite large (about 100 GB) and the full table scan takes too much time to execute.

    the question

    Why is the estimated row count different on the same index operation (example will show)? What is the optimizer doing here?

    the example database

    I tried to create a very simplyfied version of my production tables (thanks to my colleague) that show the behaviour.

    -- CREATE THE SAMPLE TABLES

    ----------------------------

    CREATE TABLE dbo.MasterTable(

    MasterId smallint NOT NULL,

    Name varchar(5) NOT NULL,

    CONSTRAINT PK_MasterTable PRIMARY KEY CLUSTERED (MasterId ASC)

    ) ON [PRIMARY]

    GO

    CREATE TABLE dbo.DetailTable(

    DetailId bigint IDENTITY(1,1) NOT NULL,

    MasterId smallint NOT NULL,

    Name nvarchar(50) NOT NULL,

    CreateDate datetime NOT NULL,

    CONSTRAINT PK_DetailTable PRIMARY KEY CLUSTERED (DetailId ASC)

    ) ON [PRIMARY]

    GO

    ALTER TABLE dbo.DetailTable

    ADD CONSTRAINT FK1

    FOREIGN KEY(MasterId) REFERENCES dbo.MasterTable (MasterId)

    GO

    CREATE NONCLUSTERED INDEX IX_DetailTable

    ON dbo.DetailTable( MasterId ASC, Name ASC )

    GO

    -- INSERT SOME SAMPLE DATA

    ----------------------------

    SET NOCOUNT ON

    go

    -- These are some Codes. In our system we always use these codes tosearch for "types" of data.

    INSERT INTO dbo.MasterTable (MasterId, Name)

    VALUES (1, 'N1'), (2, 'N2'), (3, 'N3'), (4, 'N4'), (5, 'N5'), (6, 'N6'), (7, 'N7'), (8, 'N8')

    go

    -- ADD ROWS TO THE DETAIL TABLE

    -- Takes about 1 minute to run

    -- Don't care about the logic, it's just to get a distribution similar to production system

    ----------------------------

    declare @x int = 1

    DECLARE @MasterID INT

    while (@x <= 400000)

    begin

    SET @MasterID = ABS(CHECKSUM(NEWID())) % 8 + 1

    INSERT INTO dbo.DetailTable

    (

    MasterId,

    Name,

    CreateDate)

    VALUES

    (

    CASE WHEN @MasterID IN (1, 3, 4) AND @x % 20 != 0 THEN 2

    WHEN @MasterID IN (5, 6) AND @x % 20 != 0 THEN 7

    WHEN @MasterID = 8 AND @x % 100 != 0 THEN 7

    ELSE @MasterID

    END,

    NEWID(),

    DATEADD(DAY, - ABS(CHECKSUM(NEWID())) % 1000, GETDATE())

    )

    SET @x = @x + 1

    end

    go

    -- DO THE INDEX AND STATISTIC MAINTENANCE

    ----------------------------

    alter index all on dbo.DetailTable reorganize

    alter index all on dbo.MasterTable reorganize

    update statistics dbo.DetailTable WITH FULLSCAN

    update statistics dbo.MasterTable WITH FULLSCAN

    go

    Preparation is done, let's start with the query

    Let's have a look at the statistics first, look at RANGE_HI_KEY=8, there are 489 EQ_ROWS

    -- CHECK THE STATISTICS

    ----------------------------

    dbcc show_statistics ('dbo.DetailTable', IX_DetailTable)

    go

    No we do the query. The first one is the original query I had to optimize.

    Please activate the current execution plan when executing.

    Have a look at the operation "index seek (nonclustered) [DetailTable].[IX_DetailTable]"

    -- ORIGINAL QUERY

    ----------------------------

    SELECT

    d.DetailId

    FROM dbo.DetailTable d

    INNER JOIN dbo.MasterTable m ON d.MasterId = m.MasterId

    WHERE m.Name = 'N8'

    AND d.CreateDate > '20150312 11:00:00'

    go

    -- FORCESEEK

    ----------------------------

    SELECT

    d.DetailId

    FROM dbo.DetailTable d WITH (FORCESEEK)

    INNER JOIN dbo.MasterTable m ON d.MasterId = m.MasterId

    WHERE m.Name = 'N8'

    AND d.CreateDate > '20150312 11:00:00'

    go

    -- Actual: 495, Estimated 50.000

    -- TABLE VARIABLE

    ----------------------------

    DECLARE @MasterId AS TABLE(MasterId SMALLINT )

    INSERT INTO @MasterId (MasterId)

    SELECT MasterID FROM dbo.MasterTable WHERE Name = 'N8'

    SELECT

    d.DetailId

    FROM dbo.DetailTable d WITH (FORCESEEK)

    INNER JOIN @MasterId m ON d.MasterId = m.MasterId

    WHERE d.CreateDate > '20150312 11:00:00'

    go

    -- Actual: 495, Estimated 40.000

    -- TEMP TABLE

    ----------------------------

    CREATE TABLE #MasterId( MasterId SMALLINT )

    INSERT INTO #MasterId (MasterId) SELECT MasterID FROM dbo.MasterTable WHERE Name = 'N8'

    SELECT

    d.DetailId

    FROM dbo.DetailTable d --WITH (FORCESEEK)

    INNER JOIN #MasterId m ON d.MasterId = m.MasterId

    WHERE d.CreateDate > '20150312 11:00:00'

    -- Actual 489, Estimated 489

    DROP TABLE #MasterId

    go

    And of corse the cleanup πŸ™‚

    -- CLEANUP

    ----------------------------

    drop table dbo.DetailTable;

    drop table dbo.MasterTable;

    go

    Analyse and final question(s)

    Please have a look at the operation "index seek (nonclustered) [DetailTable].[IX_DetailTable]"

    The comments in the skript above show you the values I got for estimated and actual row count.

    In our production environment this table has 33 million rows, the estimated rows in the queries above differ from 3 million to 16 million.

    The question is why do these values differ? The statistics are up to date and making an estimation should really be easy, no?

    What is optimizer doing here and why?

    I just would like to understand this.

    And finally: what would your suggestion be to optimize this query if you must avoid the clustered index scan?

    Thank you for reading this πŸ™‚ i'm looking forward for some interesting replies πŸ™‚

    Have a nice week, Wolfgang

  • Try these indexes:

    CREATE INDEX ix_MasterId_Etc ON DetailTable (MasterId, CreateDate) INCLUDE (DetailId)

    CREATE INDEX ix_MasterId_Etc ON MasterTable (Name, MasterId)

    These are a first guess based on the join columns, filters and output columns in your sample query.

    Edit: DetailId is in the clustered index.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • hi,

    thank you for the answer.

    I know I could optimize the index, but that's not the point. I'd like to understand why optimizer is estimating the way it does πŸ™‚

  • WolfgangE (3/16/2015)


    hi,

    thank you for the answer.

    I know I could optimize the index, but that's not the point. I'd like to understand why optimizer is estimating the way it does πŸ™‚

    This is how I understand it - I'm sure there will be better explanations to follow.

    -- Drop the CreateDate column for simplicity, the estimates are higher

    -- than the actuals by the same factor

    -- 512 is the exact number of rows in DetailTable where MasterId = 8

    SELECT

    d.DetailId

    FROM dbo.DetailTable d -- Est: 512 Act: 512

    INNER JOIN dbo.MasterTable m

    ON d.MasterId = m.MasterId

    WHERE m.MasterId = 8

    -- 50,000 is the average number of rows for each MasterId in DetailTable.

    -- The optimiser can't know what MasterId it will be at run time.

    SELECT

    d.DetailId

    FROM dbo.DetailTable d -- Est: 50,000 Act: 512

    INNER JOIN dbo.MasterTable m

    ON d.MasterId = m.MasterId

    WHERE m.Name = 'N8'

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Take into account that table variable stats is hard coded to be 1 row where temp table has actual row count.

    This would be something to consider as Table variable might make nested loop joins occur and/or carry forward an underestimated row count..

    Catch-all queries done right [/url]
    Gail Shaw's Performance Blog[/url]

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply