What's the best way to count?

  • vk-kirov (10/19/2010)


    Hugo Kornelis (10/19/2010)


    With this additional information, I'd argue that options 2 and 4 are both correct.

    And what do you think about counting uncommitted records, is it such a reliable thing? 🙂

    Depends.

    If you want to count order detail lines, none of the queries is good enough. (See below)

    If you want to count rows, I'd argue that you WANT uncommitted rows as well. Counting rows is soemthing a DBA does, to estimate space needed. Uncommitted rows take space as well. Counting order detail lines is more a business function; uncommitted rows should not be counted there.

    If you want total accuracy, you either have to use one of the snapshot isolation level, or you need to take an explicit table lock. Under RAD_COMMITTED, you can still get incorrect results. For example:

    In query window #1, run:

    CREATE TABLE dbo.QOTD_TEST (ID INT PRIMARY KEY);

    INSERT dbo.QOTD_TEST VALUES(1);

    INSERT dbo.QOTD_TEST VALUES(2);

    INSERT dbo.QOTD_TEST VALUES(3);

    BEGIN TRAN;

    INSERT dbo.QOTD_TEST VALUES(4);

    INSERT dbo.QOTD_TEST VALUES(5);

    In query window #2, run:

    SELECT COUNT(*) FROM dbo.QOTD_TEST;

    Execution will start, but not finish as it waits for locks from window #1 to be released.

    Now go back to window #1 and run:

    DELETE dbo.QOTD_TEST WHERE ID IN (2, 3, 4);

    COMMIT TRAN;

    The query in window #2 will now finish and show an incorrect result: 4. When the SELECT COUNT query started, there were 3 committed and 2 uncommitted rows; when it ended, there were 2 rows. This answer is arguably even less correct than the numbers returned by the other queries!


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Thanks for this one - lots of stuff I didn't know there.

    Does anyone know if there's a similar fast count method for views? I have a health check procedure which counts the number of records in a variety of tables and views (to check data load success and ensure data is there for the next stage in the process). The table counts are fine (biggest table has around a million records), but the views are terribly complex and counting them with SELECT COUNT(*)... takes ages.

    Cheers,

    Dave.

  • dave.farmer (10/19/2010)


    Does anyone know if there's a similar fast count method for views?

    Only if they are indexed. Or if the relationships between the underlying tables are such that you can derive the number of rows in the view from the number of rows in the tables.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (10/19/2010)


    Only if they are indexed. Or if the relationships between the underlying tables are such that you can derive the number of rows in the view from the number of rows in the tables.

    Sadly no, on both counts :crying:

    Oh well... thanks anyway.

  • I tried the four queries on a table that has 36k rows, using as parameter the amount of rows returned by the count(*) query, the fourth option is not accurate at all, actually it displays about 400k rows (this table have no modifications at all), and compared to the count(*) one, it less efficient as well.

    The only one that gave the most reliable result and was fast, was the count(*) option, followed by the third query. That's what I've find for this table, it might change for bigger tables 🙂

    Frank.

  • Shirley the fastest query to enter at a console is #1. Also, being the shortest, it would be the most reliable to run the first time without some sort of unintended syntax error.

  • I have been really getting beat up by the QoTD lately, but I am learning a lot. The code from #4 went right into my code library.

    The follow up discussion here is also very good. Yes, Hugo you have developed a reputation here. Whenever I get a question incorrect, it is usually your follow on explanation that helps me find some understanding. Thanks.

  • Hugo Kornelis (10/19/2010)


    The query in window #2 will now finish and show an incorrect result: 4. When the SELECT COUNT query started, there were 3 committed and 2 uncommitted rows; when it ended, there were 2 rows.

    Ah! Indeed, I forgot about this behavior. Thanks to Microsoft for introducing snapshot isolation in SQL Server 2005 🙂

  • I ran query 1 and query 4 on two tables. One table had 183 rows and the other had 9,886,862 rows. On the first table, query one had a cost of 52% and query 4 had a cost of 48%. Not much of a difference. However, when I ran it on a much larger table, query 1 had a cost of 100% and query 2 had a cost of 0%. Although I my initial answer was query 1, I am in agreement with others that query 4 is the way to go. Despite the difference in cost, both queries took less than a second to run on both tables.

  • I went with the obvious and straightforward answer; I expected to be wrong.

    "Most reliable", "best way to count", etc. are somewhat subjective.

    If volumes of code are deployed using cool-counting-trick #4 and the engineers further optimize the implementation of less-cool-but-semantically-correct method #1, the cool trick becomes the underperforming cruft of legacy code.

    I would rather exercise the discipline to use count(*) when I want a count even if it costs more than to use alternative methods that return "something like count" that isn't a true count. ex: We don't put [ChildrenRows] on a parent record and read from it with confidence that the mechanisms that keep it up to date are infallible (at least I hope we know that).

    If I am interested in knowing the approximate number of rows in a table that is too large to count() I would use sp_spaceused because it's much shorter to type than #4.

    btw, if you do deploy #4 in production code, please leave me a comment such as

    -- this performs better than: select count(*) ....

    so when I inherit the maintenance of your cleverness I don't have to spend time tracking down how that query works. thanks.

  • Hugo Kornelis (10/19/2010)


    I liked the question - allthough with the exception of option 1 on really large (or busy) tables, all options will be blindingly fast, and you won't run this type of query often enough that a few microseconds more or less really matters.

    I gave the right answer, but not for the reasons listed. The question specifically asked for "fastest and most reliable", so I focused on those aspects of the queries.

    Option 1 scans the complete table (or one of the indexes). Absolutely reliable, but not fast. This answer can only be correct if all others are not.

    Option 3 is not reliable. The Books Online entry for sys.partitions specifically describes the "rows" column as an "approximate number of rows in this partition". (The cast to float of the result and the cast to int of object_id are not wrong, though they are of course weird).

    Options 2 and 4 are both reliable. (The explanation of the question says sysindexes can be unreliable. This is not supported by the referenced sources. The sysindexes system table was unreliable in SQL 2000 and before, but this has changed in SQL Server 2005 - here is a reference. Also, Books Online does not contain the words "approximate", "inaccurate", or anything similar for these views, as it does for sys.partitions.

    Both the question explanation and the description in Books Online of sys.sysindexes mention that the sys.sysindexes compatibility view will be removed in future versions of SQL Server. This is very relevant for real work - but it was not a deciding factor in this question, so this should not be used to decide between options 2 and 4.

    The same page in Books Online also contains this warning: "Partitioned tables and indexes are not fully supported in this view; use the sys.indexes catalog view instead." The table used in the example code is not partitioned, but I interpreted the code as examples for a randomly chosen sample table, and the question about the method in general. Since method 2 does not work for partitioned tables, it's not reliable.

    Method 4 and method 1 are both reliable. Method 1 scans the table (or an index); method 4 uses systemm views - this one is much faster. That's why I chose method 4.

    Hi Hugo,

    Did you verify how method 4 works under high concurrency? Can we get somewhat incorrect totals?

  • Really nice question & good knowledge sharing conversation.

    Thanks

  • Hello!

    First of all - EXCELLENT QUESTION!

    I got it right, however I followed the following reasoning:

    1. As most have highlighted - the Query# 1 is obviously "accurate", but does a table scan and is therefore slower

    2. Query# 4 uses system views and per BOL: "The counts on which the output is based are cached in memory or stored on disk in various system tables."

    Hugo: Thank-you very much for your sound insight on the internals of SQL Server.

    You simply blasted the foundations of my reasoning# 1 by proving that SELECT COUNT(*) is indeed inaccurate.

    Hats off to you as well.

    At the end of the day, all I can say is - it's been fun learning today!

    Thanks & Regards,
    Nakul Vachhrajani.
    http://nakulvachhrajani.com

    Follow me on
    Twitter: @sqltwins

  • Great discussion. I think I've learned more from reading the discussion than I did when researching the question.

    For those concerned with maintenance nightmares or the difficulty in writing query 4 versus Select Count(*) I would suggest putting query for in a stored procedure or a function. Something like

    sproc_RecordCount '[schema].TableName'

    Not much more difficult to write than Select Count(*).

    For maintenance just change the procedure or function as needed and everywhere your code is using it continues working (assuming change does not add any required parameters and maintains backward compatibility).

    Dave

  • Terrific question, thanks.

    It's simple tasks like this that I think MS needs to work on. Judging by this question, most of us (yours truly included) can't even get an accurate count of rows in a table. That's just nuts! MS can't simply write us a function so we don't have to sweat the small stuff like this?

    I think I know why they can get away with it. It's because Oracle's even harder to use. :hehe:

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

Viewing 15 posts - 16 through 30 (of 57 total)

You must be logged in to reply to this topic. Login to reply