Quicker Way to Get Accurate Row Count

  • We are currently using SQL Server 2014 (enterprise) for our source system and data warehouse.  Our ETL has a row count validation that runs at the end of each ETL execution.  It ties out the number of records in our warehouse, staging environment, and source system for each fact table.  For many fact tables, we can achieve this with a simple row count from source/staging/warehouse (i.e. the fact table grain is the same as a source table grain).  The source system has a constant flow of data being inserted/updated/deleted, so we wrap the rowcount statement and the extraction step in a single transaction and use snapshot isolation.

    When we started this process, we got the row counts with a simple SELECT COUNT(*) FROM XXX.  However, some of these tables are quite large (350 million + rows), so the row count queries were starting to take a long time to complete.  In searching for a quicker way to get accurate row counts, I came across the following blog post: https://blogs.msdn.microsoft.com/martijnh/2010/07/15/sql-serverhow-to-quickly-retrieve-accurate-row-count-for-table/.  I thought using method #4 would be my magic bullet - the query looks like this:
    SELECT SUM (row_count)
    FROM sys.dm_db_partition_stats
    WHERE object_id=OBJECT_ID(‘Transactions’)   
    AND (index_id=0 or index_id=1);

    However, I discovered that the sys tables don’t honor isolation levels... As a result, this query doesn't obey snapshot isolation and the resulting row count will reflect I/U/D operations that occurred after the snapshot isolated transaction began.  Does anyone have a different suggestion to get accurate row counts, within snapshot isolation, that would be more performant than SELECT COUNT(*) FROM XXX?

  • The first thing that came to my mind was sys.indexes, but if the system tables don't honor isolation levels (which I didn't know) then it wouldn't work.

    Do you have a CI or a heap?  If it's a CI that's narrow, you might get better performance if you create an NCI with the same key columns as the CI and no included columns.  This may sound counter-intuitive and may sounds like a duplicate index, but it really isn't.  Give it a try and see what it does to your reads.  You'll end up reading a lot less pages.

  • Hi Ed,
    Thanks for the response.  The source tables all have clustered indexes.  I don't have a lot of flexibility when it comes to index tweaking on the source system; those tables have been optimized by our front end team.  But, I can try out your suggestion on our staging tables - that would at least solve part of the equation.  Let me know if any other ideas come to mind!

  • Here's a method from Pinal Dave that I've found useful:
    https://blog.sqlauthority.com/2017/05/25/sql-server-simple-query-list-size-table-row-counts/

  • Thanks caffeinated - that's approach #3 in the original link I sent.  Unfortunately, it returns an approximate row count - it will not be exact, particularly for tables that have a lot of activity.  As a result, it doesn't meet my requirements.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply