Data Collector Performance Issue

  • I am running Data Collector from my SQL 2008 R2 and SQL 2012 servers to a SQL 2012 Data Collector Warehouse database.

    I noticed that it's getting unbearably slow when I try to drill down to get query detail information (after you look at the summary report and it lists the top 10 queries, then click on a particular query).

    It seems that one particular part of the execution plan of the snapshots.rpt_query_stats procedure seems to be the culprit. (see screenshot at Query Plan)

    The collector warehouse database is currently at about 60 GB and the table for the query stats has about 30 GB of data.

    I am only keeping about 7 days worth of data. I am hesitant to reduce the amount of data so I would like to know if there is a fix for this issue before reducing the data retention.

    Thank you,

    John

  • Ok, no one got back to me on this and I was finally able to pinpoint the problem. It's definitely an issue with SQL Server 2012's version of the data collector warehouse DB schema (as of version 11.0.2325 at least).

    I compared the schema of the 2008 R2 and the 2012 warehouse databases. I found out that the snapshots.rpt_query_stats procedure definition has changed, for the worse.

    If you look at around line 73, you'll see that the FROM definition changed

    SQL 2008 R2

    FROM snapshots.query_stats AS stat

    SQL 2012

    FROM

    (

    SELECT *, DENSE_RANK() OVER (ORDER BY plan_handle, creation_time) AS plan_number

    FROM snapshots.query_stats

    ) AS stat

    Cost of query relative to batch in SQL 2008R2 is 57%.

    Cost of query relative to batch in SQL 2012 is 100%.

    The results of this is that when I click on a query in the query stats report to get the detail information, the query that took only a few seconds on my 60 GB SQL 2008 R2 database now takes about 8 minutes on my SQL 2012 database. Even when I reduce the data retention to 3 days dropping the database size to about 10 GB, the query is still taking 1.5 minutes.

    Can someone please confirm my results?

    Can someone also please tell me how to get this reported as a bug for 2012? (or if it's already reported as a bug)

  • Could be an indexing issue as well. Not using MDW so only a guess.

  • Yeah, it is doing a scan. I don't want to touch it because it's a Microsoft database schema. I'd like to know how to report this in if someone can confirm it's actually an issue and I'm not mis-diagnosing this.

    Thanks,

    John

  • May be MS, but that shouldn't stop you from adding an index if it helps. I worked ith PeopleSoft developers and added several indexes to PeopleSoft systems to support system performance. We didn't wait for Oracle to make those changes (and sometimes the did and others they didn't). We simply supported them as enhancements if they didn't.

  • I captured the query that is causing the issue and ran it individually so I could get the statistics IO and the actual execution plan.

    Here is the query I ran (with the server name replaced):

    SET STATISTICS IO ON

    exec snapshots.rpt_query_stats

    @instance_name=N'MyServer'

    ,@end_time='2012-09-19 18:20:02'

    ,@time_window_size=240

    ,@sql_handle_str=N'0x030009007c46a5375dfbc80082a000000100000000000000'

    ,@statement_start_offset=4852

    ,@statement_end_offset=9230

    Here is what I'm seeing. I don't see any quick indexing fix for this. Can you take a look and see what you think?

    I/O Stats:

    Table 'snapshots_internal'. Scan count 1, logical reads 712, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'snapshot_timetable_internal'. Scan count 1, logical reads 28, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'snapshots_internal'. Scan count 35, logical reads 110, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'snapshot_timetable_internal'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#A6609559'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'source_info_internal'. Scan count 0, logical reads 32, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'snapshot_timetable_internal'. Scan count 0, logical reads 32, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'snapshots_internal'. Scan count 0, logical reads 678, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'query_stats'. Scan count 1, logical reads 326201, physical reads 2, read-ahead reads 3690, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Execution Plan Screenshot:

    Sort Warning Screenshot from Execution Plan:

    Clustered Index Scan from Execution Plan:

  • My fix to this was just to replace the FROM statement that has the ranking function with the original FROM statement. I know it was probably a bug fix, but it was a matter of potentially having the plan count wrong and being able to generate the report in 5 seconds vs having the correct plan count right and waiting 8 minutes.

    Or, here is the 2008 R2 version of the procedure that you could just run on your 2012 Management Data Warehouse database (make sure you make a copy of the original just in case):

    ALTER PROCEDURE [snapshots].[rpt_query_stats]

    @instance_name sysname,

    @start_time datetime = NULL,

    @end_time datetime = NULL,

    @time_window_size smallint,

    @time_interval_min smallint = 1,

    @sql_handle_str varchar(130),

    @statement_start_offset int,

    @statement_end_offset int

    AS

    BEGIN

    SET NOCOUNT ON;

    -- @end_time should never be NULL when we are called from the Query Stats report

    -- Convert snapshot_time (datetimeoffset) to a UTC datetime

    IF (@end_time IS NULL)

    SET @end_time = CONVERT (datetime, SWITCHOFFSET (CAST ((SELECT MAX(snapshot_time) FROM core.snapshots) AS datetimeoffset(7)), '+00:00'));

    IF (@start_time IS NULL)

    BEGIN

    -- If time_window_size and time_interval_min are set use them

    -- to determine the start time

    -- Otherwise use the earliest available snapshot_time

    IF @time_window_size IS NOT NULL AND @time_interval_min IS NOT NULL

    BEGIN

    SET @start_time = DATEADD(minute, @time_window_size * @time_interval_min * -1.0, @end_time);

    END

    ELSE

    BEGIN

    -- Convert min snapshot_time (datetimeoffset) to a UTC datetime

    SET @start_time = CONVERT (datetime, SWITCHOFFSET (CAST ((SELECT MIN(snapshot_time) FROM core.snapshots) AS datetimeoffset(7)), '+00:00'));

    END

    END

    DECLARE @end_snapshot_time_id int;

    SELECT @end_snapshot_time_id = MAX(snapshot_time_id) FROM core.snapshots WHERE snapshot_time <= @end_time;

    DECLARE @start_snapshot_time_id int;

    SELECT @start_snapshot_time_id = MIN(snapshot_time_id) FROM core.snapshots WHERE snapshot_time >= @start_time;

    DECLARE @interval_sec int;

    SET @interval_sec = DATEDIFF (s, @start_time, @end_time);

    DECLARE @sql_handle-2 varbinary(64)

    SET @sql_handle-2 = snapshots.fn_hexstrtovarbin (@sql_handle_str)

    SELECT

    REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (REPLACE (

    LEFT (LTRIM (stmtsql.query_text), 100)

    , CHAR(9), ' '), CHAR(10), ' '), CHAR(13), ' '), ' ', ' '), ' ', ' '), ' ', ' ') AS flat_query_text,

    t.*,

    master.dbo.fn_varbintohexstr (t.sql_handle) AS sql_handle_str,

    stmtsql.*

    FROM

    (

    SELECT

    stat.sql_handle, stat.statement_start_offset, stat.statement_end_offset, snap.source_id,

    SUM (stat.snapshot_execution_count) AS execution_count,

    SUM (stat.snapshot_execution_count) / (@interval_sec / 60) AS executions_per_min,

    SUM (stat.snapshot_worker_time / 1000) AS total_cpu,

    SUM (stat.snapshot_worker_time / 1000) / @interval_sec AS avg_cpu_per_sec,

    SUM (stat.snapshot_worker_time / 1000.0) / CASE SUM (stat.snapshot_execution_count) WHEN 0 THEN 1 ELSE SUM (stat.snapshot_execution_count) END AS avg_cpu_per_exec,

    SUM (stat.snapshot_physical_reads) AS total_physical_reads,

    SUM (stat.snapshot_physical_reads) / @interval_sec AS avg_physical_reads_per_sec,

    SUM (stat.snapshot_physical_reads) / CASE SUM (stat.snapshot_execution_count) WHEN 0 THEN 1 ELSE SUM (stat.snapshot_execution_count) END AS avg_physical_reads_per_exec,

    SUM (stat.snapshot_logical_writes) AS total_logical_writes,

    SUM (stat.snapshot_logical_writes) / @interval_sec AS avg_logical_writes_per_sec,

    SUM (stat.snapshot_logical_writes) / CASE SUM (stat.snapshot_execution_count) WHEN 0 THEN 1 ELSE SUM (stat.snapshot_execution_count) END AS avg_logical_writes_per_exec,

    SUM (stat.snapshot_elapsed_time / 1000) AS total_elapsed_time,

    SUM (stat.snapshot_elapsed_time / 1000) / @interval_sec AS avg_elapsed_time_per_sec,

    SUM (stat.snapshot_elapsed_time / 1000.0) / CASE SUM (stat.snapshot_execution_count) WHEN 0 THEN 1 ELSE SUM (stat.snapshot_execution_count) END AS avg_elapsed_time_per_exec,

    COUNT(*) AS row_count, COUNT (DISTINCT stat.creation_time) AS plan_count

    FROM snapshots.query_stats AS stat

    INNER JOIN core.snapshots snap ON stat.snapshot_id = snap.snapshot_id

    WHERE

    snap.instance_name = @instance_name

    AND stat.sql_handle = @sql_handle-2

    AND stat.statement_start_offset = @statement_start_offset

    AND stat.statement_end_offset = @statement_end_offset

    AND snap.snapshot_time_id BETWEEN @start_snapshot_time_id AND @end_snapshot_time_id

    GROUP BY stat.sql_handle, stat.statement_start_offset, stat.statement_end_offset, snap.source_id

    ) t

    LEFT OUTER JOIN snapshots.notable_query_text sql ON t.sql_handle = sql.sql_handle and sql.source_id = t.source_id

    OUTER APPLY snapshots.fn_get_query_text (t.source_id, t.sql_handle, t.statement_start_offset, t.statement_end_offset) AS stmtsql

    -- These trace flags are necessary for a good plan, due to the join on ascending PK w/range filter

    OPTION (QUERYTRACEON 2389, QUERYTRACEON 2390)

    END

    GO

    Hopefully this will get fixed in an upcoming patch of 2012.

  • Thank you for providing us this solution. I was looking to this issue for several weeks. My query_stats table is also very big: 30,000,000 rows for only 1 managed instance with 7 days data retention.

    I put page compression on the table, and changed the frequency of the data collections to 120 sec.

    Tried to add indexes from the DMV missing indexes, but nothing helped. I faced the same problem as you, having to wait more than 10 minutes to get the query details report. Now with the 2008R2 version of rpt_query_stats, I get it in 5 seconds.

    You make me very very happy today. FYI, my SQL 2012 version is 11.0.3393.

  • You also made my day a bg thnaks... :o)..., I had actuallly made a MS Support case, but they couldn't help me...

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply