Join-cardinality estimation BUG (up to 50% error) with evenly distributed data in hystogram?

  • Hi All!

    Can anyone (who is not drunk after holidays ๐Ÿ™‚ ) run an attached script on test server and post result here.

    The script will create 2 simple tables with FK relation and populate them evenly with 100 and 500,000 rows. So the result of the join will return 500,000 rows but looks like that estimates generated by QP may produce error up to 50%. In my case estimated row count is 255,000 which is 2 times less than expected. Which can be fixed by inserting ONE record to the smaller table. And this is weird. Or maybe someone can point me to where I made a mistake.

    So I'm interested in Rows and EstimateRows values from first record of both results and the version of your server. If this bug is easily reproducible not only on my server then I will consider to report it to MS.

    Thanks!


    Alex Suprun

  • Ok, I have a question then...

    How SQL Server estimates how many records will be after joining 2 tables?

    And why for such a simple select on relatively small tables (100 and 500,000 rows each, with FK and PK) the estimates are so bad?

    SELECT C.CustomerID, B.BranchID

    FROM dbo.Customer C

    INNER JOIN dbo.Branch B ON B.BranchId = C.BranchId

    But estimated number of rows becomes perfect after inserting just one records into one of the tables?

    INSERT INTO dbo.Branch (BranchID, BranchName)

    SELECT 0, 'test'

    UPDATE STATISTICS dbo.Branch WITH FULLSCAN

    SELECT C.CustomerID, B.BranchID

    FROM dbo.Customer C

    INNER JOIN dbo.Branch B ON B.BranchId = C.BranchId

    I can understand that statistics and estimations are not perfect, but why so drastically?


    Alex Suprun

  • I don't think the estimates are that far off. A difference of 50% (or two to one) does not generally produce a bad execution plan. That's all you need from the estimates: a valid execution plan.

    I've heard a few presentations by Grant Fritchey when he has said that he really doesn't worry about the estimates unless they are off by ten to one (or even 100 to one).

    The manner in which the statistics get updated is quite interesting. If you have more than 1,500 records in a table, the statistics are not updated unless 20% or more of the records change. So, it is quite possible to add a single record and get new statistics for a table. In this case for instance, you could add another 99,999 records to the table with 500,000 records and the statistics would not update. Add one more record, and voila the statistics are updated! It's a cool party trick. ๐Ÿ™‚

  • David Moutray (12/28/2011)


    I don't think the estimates are that far off. A difference of 50% (or two to one) does not generally produce a bad execution plan.

    I cannot agree with you here.

    A difference of 50% is what? If it's an OLTP database and queries are accessing hundreds of rows then I assume that 50% mistake is acceptable. But what about huge DW? Instead of real 10 millions records mistake in 50% will give us 5 millions estimated.

    That's all you need from the estimates: a valid execution plan.

    ... and the amount of memory granted for this query would be let's say 1 gigabyte instead of 2 required, and voilร  2GB are spilled to tempdb for hash/sort operation with an expensive physical reads/writes. But plan is completely valid! In this case I'm not sure what is a valid plan. The one which do all the correct operations in perfect order or the one which perform fast? With tempdb operations you may not get a good performance even if a plan looks valid.

    I've heard a few presentations by Grant Fritchey when he has said that he really doesn't worry about the estimates unless they are off by ten to one (or even 100 to one).

    I would like to hear that too. Again for hundreds of records it may be OK, but for millions or billions. Estimates 100 to 1 and valid plans generated? I can hardly believe it. What is the purpose of such estimates? It's better to disable statistics at all and put hints in every query.

    I can understand some discrepancy when I hit limitation of their math model, like putting filters on correlated columns or running some complicated queries, but it my case it's just 2 simple tables where estimates must equal to the size of the bigger table but for some reason it's not like this. I will try to do more investigations on this.

    The manner in which the statistics get updated is quite interesting. If you have more than 1,500 records in a table, the statistics are not updated unless 20% or more of the records change. So, it is quite possible to add a single record and get new statistics for a table. In this case for instance, you could add another 99,999 records to the table with 500,000 records and the statistics would not update. Add one more record, and voila the statistics are updated! It's a cool party trick. ๐Ÿ™‚

    Not my case, I do UPDATE STATISTICS WITH FULLSCAN each time, so it's not about when autostats runs or not.


    Alex Suprun

  • Per BOL:

    When a query is first compiled, if the optimizer needs a particular statistics object, and that statistics object exists, the statistics object is updated if it is out of date. When a query is executed and its plan is in the cache, the statistics the plan depends on are checked to see if they are out of date. If so, the plan is removed from the cache, and during recompilation of the query, the statistics are updated. The plan also is removed from the cache if any of the statistics it depends on have changed.

    SQL Server 2008 determines whether to update statistics based on changes to column modification counters (colmodctrs).

    A statistics object is considered out of date in the following cases:

    ยท If the statistics is defined on a regular table, it is out of date if:

    o The table size has gone from 0 to >0 rows (test 1).

    o The number of rows in the table when the statistics were gathered was 500 or less, and the colmodctr of the leading column of the statistics object has changed by more than 500 since then (test 2).

    o The table had more than 500 rows when the statistics were gathered, and the colmodctr of the leading column of the statistics object has changed by more than 500 + 20% of the number of rows in the table when the statistics were gathered (test 3).

    I hope it explains your scenario.

    For More: http://msdn.microsoft.com/en-us/library/dd535534(SQL.100).aspx

  • Dev (12/29/2011)


    I hope it explains your scenario.

    Unfortunately not. The BOL is the first place where I search for explanations but it usually doesn't provide much details.

    But again this is not my scenario, statistics is up to date with full 100% scan. My question was more about how the server utilizes statistics rather then when it decides to update it.


    Alex Suprun

  • well, first the two queries are not the same; one is a straight insert of static values for every row in the table , or 500K.

    the other is a select based on an inner join.

    it pretty clear that based on statistics,for the second query the system is estimating that it thinks 245K rows will not be matched in the inner join, hence the 255K rows , right?

    Im not sure why you think they should have the same number of rows. the inner join is going to definitely affect the number of rows;

    i'm betting if you look at the actual execution plan, the # of rows will show the same differences; 255 on the join, and 500K on the insert.

    if you change the join to a left outer join, i'd expect that the # of rows would be 500K or pretty close to that.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • ok, now i am confused; i thought the insert was SELECT 0, 'test'

    from Customer....but it's just SELECT 0, 'test'

    for that, i'd expect an estimate of one row....

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell (12/29/2011)


    ok, now i am confused; i thought the insert was SELECT 0, 'test'

    from Customer....but it's just SELECT 0, 'test'

    for that, i'd expect an estimate of one row....

    Sorry I didn't make it clear from the beginning. The both plans are for the same SELECT query. The only difference is the extra new records. I updated my previous post with the plans.

    And yes I know if I change INNER JOIN to LEFT JOIN it will fix the estimates. But in some cases it's not a good solution because these joins are not equivalent when you have a filter on the second table.


    Alex Suprun

  • Actually INNER JOIN and LEFT JOIN will produce the same result with WHERE clause, but in this case internally SQL will do an INNER JOIN instead of LEFT so the estimates will become wrong again as it was with INNER JOIN.


    Alex Suprun

  • Alexander Suprun (12/29/2011)


    Dev (12/29/2011)


    I hope it explains your scenario.

    Unfortunately not. The BOL is the first place where I search for explanations but it usually doesn't provide much details.

    But again this is not my scenario, statistics is up to date with full 100% scan. My question was more about how the server utilizes statistics rather then when it decides to update it.

    I thought you were stumped on following...

    Alexander Suprun (12/28/2011)


    ...

    But estimated number of rows becomes perfect after inserting just one records into one of the tables?

    INSERT INTO dbo.Branch (BranchID, BranchName)

    SELECT 0, 'test'

    UPDATE STATISTICS dbo.Branch WITH FULLSCAN

    SELECT C.CustomerID, B.BranchID

    FROM dbo.Customer C

    INNER JOIN dbo.Branch B ON B.BranchId = C.BranchId

    ...

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply