What's the best way to count?

  • Dave62

    SSCertifiable

    Points: 6377

    Comments posted to this topic are about the item What's the best way to count?

  • Oleg Netchaev

    SSCertifiable

    Points: 5268

    This is a very good question, thank you Dave. The key here is the index_id: if the table has a clustered index then there will be a record in the stats view with index_id = 1. If the table is a heap then it will have index_id = 0. There is no way to have both index_id 0 and 1, so index_id < 2 or like you have it index_id = 0 or index_id = 1 should suffice. Grouping is important because it possible to have partitioned table.

    Oleg

  • malleswarareddy_m

    SSCertifiable

    Points: 5847

    Good question.keep on posting this type of questions.

    Malleswarareddy
    I.T.Analyst
    MCITP(70-451)

  • David.Poole

    SSC Guru

    Points: 75109

    I can't see anything that says that the figure returned will be reliable. Fast yes, blindingly so. Reliable.....not so sure.

  • vk-kirov

    SSCertifiable

    Points: 7686

    which query is the fastest and most reliable for returning an accurate record count?

    (One of) the fastest? Yes. The most reliable? I don't think so, because that query counts both committed and uncommitted records, while the "SELECT COUNT(*)" statement counts only committed records.

    Here is an example.

    1) Create a table:

    CREATE TABLE dbo.QOTD_TEST (ID INT);

    2) Insert several records into the table, don't commit them:

    BEGIN TRANSACTION;

    INSERT dbo.QOTD_TEST VALUES(1);

    INSERT dbo.QOTD_TEST VALUES(2);

    INSERT dbo.QOTD_TEST VALUES(3);

    3) In another query window (e.g. in another connection) run queries 2-4:

    SELECT CONVERT(bigint, [rows])

    FROM sys.sysindexes

    WHERE id = OBJECT_ID('dbo.QOTD_TEST') AND indid < 2

    SELECT CAST(p.[rows] AS float)

    FROM sys.tables AS tbl

    INNER JOIN sys.indexes AS idx ON idx.[object_id] = tbl.[object_id] And idx.index_id < 2

    INNER JOIN sys.partitions AS p ON p.[object_id] = CAST(tbl.[object_id] AS int) And p.index_id=idx.index_id

    WHERE ((tbl.name = N'QOTD_TEST' AND SCHEMA_NAME(tbl.schema_id) = 'dbo'));

    SELECT SUM (row_count)

    FROM sys.dm_db_partition_stats

    WHERE object_id = OBJECT_ID('dbo.QOTD_TEST') AND (index_id=0 or index_id=1);

    All these queries return the value 3.

    4) After that, run the 1st query:

    SELECT COUNT(*) FROM dbo.QOTD_TEST;

    If the database READ_COMMITTED_SNAPSHOT option is set to ON, this query will immediately return the value 0. If READ_COMMITTED_SNAPSHOT = OFF, this query will wait until the locks are removed.

    5) In 1st query window, roll back the transaction and drop the table:

    ROLLBACK TRANSACTION;

    DROP TABLE dbo.QOTD_TEST;

    I think that the 1st query is the most reliable, while others are the fastest.

  • Bob JH Cullen

    SSCrazy

    Points: 2082

    I chose #1 because all the others appeared to be referencing indexes, and there was nothing in the question that stated the table had an index.

    Maybe I'm missing something here...?

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    I liked the question - allthough with the exception of option 1 on really large (or busy) tables, all options will be blindingly fast, and you won't run this type of query often enough that a few microseconds more or less really matters.

    I gave the right answer, but not for the reasons listed. The question specifically asked for "fastest and most reliable", so I focused on those aspects of the queries.

    Option 1 scans the complete table (or one of the indexes). Absolutely reliable, but not fast. This answer can only be correct if all others are not.

    Option 3 is not reliable. The Books Online entry for sys.partitions specifically describes the "rows" column as an "approximate number of rows in this partition". (The cast to float of the result and the cast to int of object_id are not wrong, though they are of course weird).

    Options 2 and 4 are both reliable. (The explanation of the question says sysindexes can be unreliable. This is not supported by the referenced sources. The sysindexes system table was unreliable in SQL 2000 and before, but this has changed in SQL Server 2005 - here is a reference. Also, Books Online does not contain the words "approximate", "inaccurate", or anything similar for these views, as it does for sys.partitions.

    Both the question explanation and the description in Books Online of sys.sysindexes mention that the sys.sysindexes compatibility view will be removed in future versions of SQL Server. This is very relevant for real work - but it was not a deciding factor in this question, so this should not be used to decide between options 2 and 4.

    The same page in Books Online also contains this warning: "Partitioned tables and indexes are not fully supported in this view; use the sys.indexes catalog view instead." The table used in the example code is not partitioned, but I interpreted the code as examples for a randomly chosen sample table, and the question about the method in general. Since method 2 does not work for partitioned tables, it's not reliable.

    Method 4 and method 1 are both reliable. Method 1 scans the table (or an index); method 4 uses systemm views - this one is much faster. That's why I chose method 4.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    Bob Cullen-434885 (10/19/2010)


    I chose #1 because all the others appeared to be referencing indexes, and there was nothing in the question that stated the table had an index.

    Maybe I'm missing something here...?

    If the table has a clustered index, information is stored in the system views as the index with index id 1.

    If the table has no clustered index, information is stored in the system views as the index with index id 0. Not a real index, but apprently Microsoft decided to take a practical approach and not use specific different system views for heaps (tables without clustered index).


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Bob JH Cullen

    SSCrazy

    Points: 2082

    Well, at least I got something right - I had missed something!!

    Thanks, Hugo 🙂

  • vk-kirov

    SSCertifiable

    Points: 7686

    Hugo Kornelis (10/19/2010)


    The same page in Books Online also contains this warning: "Partitioned tables and indexes are not fully supported in this view; use the sys.indexes catalog view instead." ... Since method 2 does not work for partitioned tables, it's not reliable.

    The 2nd query works just fine for partitioned tables 🙂

    That BOL page states what columns are not supported in case of partitioned tables: first, root, groupid, dpages, reserved, used, rowcnt, rowmodctr, and FirstAIM. The column 'rows' is not in this list, and it's supported.

  • Hugo Kornelis

    SSC Guru

    Points: 64645

    vk-kirov (10/19/2010)


    Hugo Kornelis (10/19/2010)


    The same page in Books Online also contains this warning: "Partitioned tables and indexes are not fully supported in this view; use the sys.indexes catalog view instead." ... Since method 2 does not work for partitioned tables, it's not reliable.

    The 2nd query works just fine for partitioned tables 🙂

    That BOL page states what columns are not supported in case of partitioned tables: first, root, groupid, dpages, reserved, used, rowcnt, rowmodctr, and FirstAIM. The column 'rows' is not in this list, and it's supported.

    I missed that - I stopped reading after seeing that quote. Thanks, vk-kirov!

    With this additional information, I'd argue that options 2 and 4 are both correct. The unreliability of option 2 stopped after SQL 2000 (and the question specifically mentions SQL 2005 and later), and being deprecated was nog mentioned as a deciding factor in the question.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Koen Verbeeck

    SSC Guru

    Points: 258907

    Option 1 isn't that reliable. For really really large tables (e.g. number of records > 2^31-1) it will produce an error. To be 100% sure, COUNT_BIG should have been used.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck

    SSC Guru

    Points: 258907

    Oh yeah I forgot: great question! 😀

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • free_mascot

    One Orange Chip

    Points: 27168

    One more challenging question, thank you Dave.

    ---------------------------------------------------
    "Thare are only 10 types of people in the world:
    Those who understand binary, and those who don't."

  • vk-kirov

    SSCertifiable

    Points: 7686

    Hugo Kornelis (10/19/2010)


    With this additional information, I'd argue that options 2 and 4 are both correct.

    And what do you think about counting uncommitted records, is it such a reliable thing? 🙂

Viewing 15 posts - 1 through 15 (of 58 total)

You must be logged in to reply to this topic. Login to reply