MDW Database bloats; Purge job runs continuously

  • Hi All,

    I have a SQL 2008 R2 Enterprise instance (named SQLUTILITY) which is monitoring six other SQL 2008 R2 instances. This instance has a single database: Sysutility_MDW. The database was created when I ran the wizard to make it a Utility Control Point. I then ran the UCP wizard on each monitored instance to point to SQLUTILITY.

    I then also enabled Data Collection on each of the monitored instance to gather additional statistics (Server, Query, Disk). The collectors and UCP have been running on each instance just fine, since I set them up on January 18th, 2011.

    However, today is February 1st, 2011, and the Sysutility_MDW database is now 300GB in size. In SSMS I can see that the data file is really full, too. There is only 1% of slack space. I find this a bit shocking. The monitored instances are not particularly busy, and all the collection schedules are at their default setting. I had expected a monitoring database to be a few gigabytes in size.

    I see that there is a daily job to purge the database, but it runs and runs and the database never shrinks.

    This is not my first implementation of MDW. We had previously set it up on our Development cluster as a clustered instance. The Purge job ran continuously and actually hammered the network because the database was on an iSCSI drive, and it saturated the switch with iSCSI disk traffic.

    We shut that down and created a new UCP/MDW solution on a dedicated instance, using local storage. Now it’s not interfering with anything else, and the reports it generates are great, but the database has not stopped growing. And the Purge job is constantly running.

    Things I’ve tried:

    · I have Googled the issue and I saw that there was a database bloat issue in SQL 2008 R1, caused by orphaned records, that was supposedly corrected in SQL 2008 R2, which is what I am running (http://support.microsoft.com/kb/971823)

    · I found one person who said he improved performance by adding a nonclustered index , which I went ahead and created today (http://blogs.msdn.com/b/petersad/archive/2009/04/23/sql-server-data-collector-nightly-purge-can-leave-orphaned-rows.aspx) . No idea how effective it will be.

    · I found a blog post (http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/) by Todd Baker who also had my problem. He solved it by modifying the stored procedure that handles purges ([Core].[SP_Purge_Data]) to use a temp table, allowing it to process records much more quickly. I copied my SP_Purge_Data to SP_Purge_Data2, cancelled my Purge_MDW SQL Agent job, and executed the sproc directly. It is still running.

    I have several questions:

    1. It is supported, or advisable, to use a single MDW database for both UCP and other data collectors? Could this be introducing orphaned records?

    2. Is there a recommended maximum number of instances to monitor per MDW solution? It occurs to me that the data collectors may be uploading data faster than the purge can keep up with.

    3. Has anyone tried the solution suggested by Todd Baker? Does it work? Is it supportable?

    4. Is there something else I am missing?

    (Cross posted to TechNet forum at: http://social.msdn.microsoft.com/Forums/en-US/sqltools/thread/9c21d232-d1ea-439e-a28b-919711d706bb)

  • Have you checked what table is using most of the space? I bet is snapshots.notable_query_plan.

    I had the same issue in the past with 2008 R1, and all I could do was deleting the rows from the above table where rows were older than 1 month and then rebuilt indexes regularly. I added the delete as part of the purge job and that kept the mdw DB under a controlled size.

    I haven't used it on R2, but I was hoping the issue was fixed. Please keep us posted if you find a proper fix.

  • No response from Microsoft yet, so I don't know what the official fix for this would be.

    But I think I have a workaround in place. The modified stored procedure suggested in Todd Baker's blog post has worked for me. I manually ran it last night. It ran for 12 hours with the following effect:

    1. The primary data file is still padded at 212GB, but it now has 34% free (vs 1% free). I decided to leave it at its current size.

    2. The log file, which was 90GB and wouldn't let me shrink it (possibly due to pending transactions) went to 98% free and let me shrink it to its original size of 10MB.

    3. Apparently none of the statistical history is gone; I can still browse back to January 18th, when I first set it up.

    4. Reports are rendering much, much, much faster. For some reports, it used to take 30-60 seconds (or more) for the page to load. Now it's nearly instantaneous.

    5. UCP also seems to be working fine: it contains current data and it is performing faster, as well.

    So I think I am ready to recommend "The Baker Plan" 🙂

    http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/

    I was a little bit uncertain as to where to paste in his modified code, so I took a guess. It works for me. Note that anything you change in the business logic of the MDW database could (and probably will) be overwritten, or rendered inoperable, when Microsoft releases any future service packs. So, caveat emptor.

    Below is my modified "sp_purge_data2" sproc:

    [font="Courier New"]USE [sysutility_mdw]

    GO

    /****** Object: StoredProcedure [core].[sp_purge_data2] Script Date: 02/02/2011 08:29:15 ******/

    SET ANSI_NULLS ON

    GO

    SET QUOTED_IDENTIFIER ON

    GO

    CREATE PROCEDURE [core].[sp_purge_data2]

    @retention_days smallint = NULL,

    @instance_name sysname = NULL,

    @collection_set_uid uniqueidentifier = NULL,

    @duration smallint = NULL

    AS

    BEGIN

    -- Security check (role membership)

    IF (NOT (ISNULL(IS_MEMBER(N'mdw_admin'), 0) = 1) AND NOT (ISNULL(IS_SRVROLEMEMBER(N'sysadmin'), 0) = 1))

    BEGIN

    RAISERROR(14677, 16, -1, 'mdw_admin')

    RETURN(1) -- Failure

    END

    -- Validate parameters

    IF ((@retention_days IS NOT NULL) AND (@retention_days < 0))

    BEGIN

    RAISERROR(14200, -1, -1, '@retention_days')

    RETURN(1) -- Failure

    END

    IF ((@duration IS NOT NULL) AND (@duration < 0))

    BEGIN

    RAISERROR(14200, -1, -1, '@duration')

    RETURN(1) -- Failure

    END

    -- This table will contain a record if somebody requests purge to stop

    -- If user requested us to purge data - we reset the content of it - and proceed with purge

    -- If somebody in a different session wants purge operations to stop he adds a record

    -- that we will discover while purge in progress

    --

    -- We dont clear this flag when we exit since multiple purge operations with differnet

    -- filters may proceed, and we want all of them to stop.

    DELETE FROM [core].[purge_info_internal]

    SET @instance_name = NULLIF(LTRIM(RTRIM(@instance_name)), N'')

    -- Calculate the time when the operation should stop (NULL otherwise)

    DECLARE @end_time datetime

    IF (@duration IS NOT NULL)

    BEGIN

    SET @end_time = DATEADD(minute, @duration, GETUTCDATE())

    END

    -- Declare table that will be used to find what are the valid

    -- candidate snapshots that could be selected for purge

    DECLARE @purge_candidates table

    (

    snapshot_id int NOT NULL,

    snapshot_time datetime NOT NULL,

    instance_name sysname NOT NULL,

    collection_set_uid uniqueidentifier NOT NULL

    )

    -- Find candidates that match the retention_days criteria (if specified)

    IF (@retention_days IS NULL)

    BEGIN

    -- User did not specified a value for @retention_days, therfore we

    -- will use the default expiration day as marked in the source info

    INSERT INTO @purge_candidates

    SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid

    FROM core.snapshots s

    WHERE (GETUTCDATE() >= s.valid_through)

    END

    ELSE

    BEGIN

    -- User specified a value for @retention_days, we will use this overriden value

    -- when deciding what means old enough to qualify for purge this overrides

    -- the days_until_expiration value specified in the source_info_internal table

    INSERT INTO @purge_candidates

    SELECT s.snapshot_id, s.snapshot_time, s.instance_name, s.collection_set_uid

    FROM core.snapshots s

    WHERE GETUTCDATE() >= DATEADD(DAY, @retention_days, s.snapshot_time)

    END

    -- Determine which is the oldest snapshot, from the list of candidates

    DECLARE oldest_snapshot_cursor CURSOR FORWARD_ONLY READ_ONLY FOR

    SELECT p.snapshot_id, p.instance_name, p.collection_set_uid

    FROM @purge_candidates p

    WHERE

    ((@instance_name IS NULL) or (p.instance_name = @instance_name)) AND

    ((@collection_set_uid IS NULL) or (p.collection_set_uid = @collection_set_uid))

    ORDER BY p.snapshot_time ASC

    OPEN oldest_snapshot_cursor

    DECLARE @stop_purge int

    DECLARE @oldest_snapshot_id int

    DECLARE @oldest_instance_name sysname

    DECLARE @oldest_collection_set_uid uniqueidentifier

    FETCH NEXT FROM oldest_snapshot_cursor

    INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid

    -- As long as there are snapshots that matched the time criteria

    WHILE @@FETCH_STATUS = 0

    BEGIN

    -- Filter out records that do not match the other filter crieria

    IF ((@instance_name IS NULL) or (@oldest_instance_name = @instance_name))

    BEGIN

    -- There was no filter specified for instance_name or the instance matches the filter

    IF ((@collection_set_uid IS NULL) or (@oldest_collection_set_uid = @collection_set_uid))

    BEGIN

    -- There was no filter specified for the collection_set_uid or the collection_set_uid matches the filter

    BEGIN TRANSACTION tran_sp_purge_data

    -- Purge data associated with this snapshot. Note: deleting this snapshot

    -- triggers cascade delete in all warehouse tables based on the foreign key

    -- relationship to snapshots table

    -- Cascade cleanup of all data related referencing oldest snapshot

    DELETE core.snapshots_internal

    FROM core.snapshots_internal s

    WHERE s.snapshot_id = @oldest_snapshot_id

    COMMIT TRANSACTION tran_sp_purge_data

    PRINT 'Snapshot #' + CONVERT(NVARCHAR(MAX),@oldest_snapshot_id) + ' purged.';

    END

    END

    -- Check if the execution of the stored proc exceeded the @duration specified

    IF (@duration IS NOT NULL)

    BEGIN

    IF (GETUTCDATE()>=@end_time)

    BEGIN

    PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';

    BREAK

    END

    END

    -- Check if somebody wanted to stop the purge operation

    SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]

    IF (@stop_purge > 0)

    BEGIN

    PRINT 'Stopping purge. Detected a user request to stop purge.';

    BREAK

    END

    -- Move to next oldest snapshot

    FETCH NEXT FROM oldest_snapshot_cursor

    INTO @oldest_snapshot_id, @oldest_instance_name, @oldest_collection_set_uid

    END

    CLOSE oldest_snapshot_cursor

    DEALLOCATE oldest_snapshot_cursor

    PRINT 'Deleting orphaned rows from snapshots.notable_query_plan...'

    -- Delete orphaned rows from snapshots.notable_query_plan. Query plans are not deleted by the generic purge

    -- process that deletes other data (above) because query plan rows are not tied to a particular snapshot ID.

    -- Purging query plans table and the smaller query text table as a special case, by looking for plans that

    -- are no longer referenced by any of the rows in the snapshots.query_stats table. We need to delete these

    -- rows in small chunks, since deleting many GB in a single delete statement would cause lock escalation and

    -- an explosion in the size of the transaction log (individual query plans can be 10-50MB).

    DECLARE @delete_batch_size bigint;

    DECLARE @rows_affected int;

    SET @delete_batch_size = 500;

    SET @rows_affected = 500;

    --Wasn't sure if this WHILE clause was still needed

    WHILE (@rows_affected = @delete_batch_size)

    BEGIN

    -- DELETE TOP (@delete_batch_size) snapshots.notable_query_plan

    -- FROM snapshots.notable_query_plan AS qp

    -- WHERE NOT EXISTS (

    -- SELECT snapshot_id

    -- FROM snapshots.query_stats AS qs

    -- WHERE qs.[sql_handle] = qp.[sql_handle] AND qs.plan_handle = qp.plan_handle

    -- AND qs.plan_generation_num = qp.plan_generation_num

    -- AND qs.statement_start_offset = qp.statement_start_offset

    -- AND qs.statement_end_offset = qp.statement_end_offset

    -- AND qs.creation_time = qp.creation_time);

    -- SET @rows_affected = @@ROWCOUNT;

    -- IF(@rows_affected > 0)

    -- BEGIN

    -- RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1, @rows_affected) WITH NOWAIT;

    -- END

    ------------------------------------------------------

    ----------------BEGIN NEW CODE

    -- Pasted from http://www.toddbaker.org/blog/2010/12/17/sql-2008-mdw-fixing-long-running-purges/

    ------------------------------------------------------

    select

    sql_handle,

    plan_handle,

    plan_generation_num,

    statement_start_offset,

    statement_end_offset,

    creation_time

    into #nqp

    FROM snapshots.notable_query_plan qp

    WHERE NOT EXISTS (

    SELECT *

    FROM snapshots.query_stats AS qs

    WHERE qs.[sql_handle] = qp.[sql_handle]

    AND qs.plan_handle = qp.plan_handle

    AND qs.plan_generation_num = qp.plan_generation_num

    AND qs.statement_start_offset = qp.statement_start_offset

    AND qs.statement_end_offset = qp.statement_end_offset

    AND qs.creation_time = qp.creation_time)

    SET @rows_affected=1

    WHILE (@rows_affected>0)

    BEGIN

    DELETE TOP (@delete_batch_size)

    FROM snapshots.notable_query_plan

    FROM #nqp n

    WHERE

    n.sql_handle=notable_query_plan.sql_handle

    AND notable_query_plan.plan_handle = n.plan_handle

    AND notable_query_plan.plan_generation_num=n.plan_generation_num

    AND notable_query_plan.statement_end_offset=n.statement_end_offset

    AND notable_query_plan.statement_start_offset=n.statement_start_offset

    AND notable_query_plan.creation_time=n.creation_time

    SET @rows_affected = @@ROWCOUNT;

    IF(@rows_affected > 0)

    BEGIN

    RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_plan', 0, 1,

    @rows_affected) WITH NOWAIT;

    END

    END

    drop table #nqp

    ------------------------------------------------------

    ----------------END NEW CODE

    ------------------------------------------------------

    -- Check if the execution of the stored proc exceeded the @duration specified

    IF (@duration IS NOT NULL)

    BEGIN

    IF (GETUTCDATE()>=@end_time)

    BEGIN

    PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';

    BREAK

    END

    END

    -- Check if somebody wanted to stop the purge operation

    SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]

    IF (@stop_purge > 0)

    BEGIN

    PRINT 'Stopping purge. Detected a user request to stop purge.';

    BREAK

    END

    END;

    -- Do the same purge process for query text rows in the snapshots.notable_query_text table.

    SET @rows_affected = 500;

    WHILE (@rows_affected = @delete_batch_size)

    BEGIN

    DELETE TOP (@delete_batch_size) snapshots.notable_query_text

    FROM snapshots.notable_query_text AS qt

    WHERE NOT EXISTS (

    SELECT snapshot_id

    FROM snapshots.query_stats AS qs

    WHERE qs.[sql_handle] = qt.[sql_handle]);

    SET @rows_affected = @@ROWCOUNT;

    IF(@rows_affected > 0)

    BEGIN

    RAISERROR ('Deleted %d orphaned rows from snapshots.notable_query_text', 0, 1, @rows_affected) WITH NOWAIT;

    END

    -- Check if the execution of the stored proc exceeded the @duration specified

    IF (@duration IS NOT NULL)

    BEGIN

    IF (GETUTCDATE()>=@end_time)

    BEGIN

    PRINT 'Stopping purge. More than ' + CONVERT(NVARCHAR(MAX),@duration) + ' minutes passed since the start of operation.';

    BREAK

    END

    END

    -- Check if somebody wanted to stop the purge operation

    SELECT @stop_purge = COUNT(stop_purge) FROM [core].[purge_info_internal]

    IF (@stop_purge > 0)

    BEGIN

    PRINT 'Stopping purge. Detected a user request to stop purge.';

    BREAK

    END

    END;

    END

    GO[/font]

  • Microsoft responded to my post:

    Hello,

    We are working on fixing this issue in next cumulative update. Please let us know if you would be willing to test this fix on your test machine. I can send you the fixed T-SQL code.

    You can email me at sethu.srinivasan@microsoft.com

    Thanks

    Sethu Srinivasan

    SQL Server team

    http://blogs.mdsn.com/sqlagent%5B/i%5D

    So, apparently a fix is in the works, and it's up for grabs.

  • I STRONGLY STRONGLY encourage you to stop using the MDW and purchase a third party product for your server monitoring. MDW is NOT ready for prime time and will not be so for years (probably not ever to be honest with you) - the proposed "fixes" notwithstanding.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Hi Kevin,

    Can you please explain why you believe that it is such a bad idea to use MDW to collect data for monitoring reports/tools.

    And do you have any recommendations on which third party tools I should use instead?

  • Saw your post about MDW and I must agree about it not being ready.

    Meanwhile I want a tool that will persist DMV data and allow me to ID what Procs/queries were running during a timeslot (frequency, IO costs, etc for that period in time) That will allow me to drill down for realy time problem analysis and go back to a period in time later.

    I assume they all do the gathering of straight metrics well but which one does this the best?

    Sorry for the semi hijacking of the thread. Maybe I should start another?

    ...Ray

  • i noticed issues with MDW while testing 2008, i personally wouldn't touch it at present

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Update:

    I've been using MDW and UCP together for months now, and they have been working fine. Apparently the fix I used has made a big difference. I have eight instances reporting to a single utility server (using Data Collection agents and Utility Control Point agents) and the reports have been very useful for trending and baselining performance, identifying causes of waits, and tracking disk usage. No performance issues, database bloating, disk thrashing, or slow reports. My customers have been very happy with this solution.

    For what you get for free (or at least, with the SQL license), it serves me very well. So I don't know why I'm being advised to abandon it and invest thousands of dollars on a solution that may or may not acheive the same result.

    The point of this solution is to give my customers some metrics they can use, and to "dumb it down" a little so they can see the impact of bad indexes, expensive queries, and pressure from memory or CPU. It gives the database designers feedback on how the changes they make have impacted performance. True, it has required me to fiddle around and make changes to an imperfect monitoring solution, but since when has any of us shied away from that?

    The point is, I can never seem to convince my customers to buy tools from Idera or Redgate, even after demos and evaluation. But every time I've shown them MDW and UCP, they immediately liked it, especially now that I've stabilized it. Make of that what you will. I'm going to keep installing it.

  • I have been using MDW for 3 days and my ManagementDW database is averaging 1Gb/day growth.

    I am manually running the purge but I don't seem to be getting the same space reduction, and I only have 11870 rows in notable_query_plan

    SizeMBRowCountTableName

    43311870notable_query_plan

    I am going to continue using MDW as we have the microsoft stack of products and see where it goes.

  • I have copied the above stored procedure and executed in my SQL editor window. Proc has been created under database. What is the next plan. DO we need to execute the proc to make the free space available in this mdw database.

    Please help me

  • Yes, I would execute it and see if it helps. What version of SQL are you using?

    I would also recommend posting your question on the tech net forums because those are monitored by Microsoft employees and they generally respond within 24 hours. I suggest that because this is a very old thread and not many people will see your question. A new thread on tech net is likely to get a better response.

    I never did figure out why the purge job wasn't working. The fix I posted is something I saw on a blog. I tried it and it worked for me. You might see some improvement but it sounds like your situation is much worse than mine. Consider opening a ticket with Microsoft support.

    Keep in mind that the change I made to Microsoft stored procedure is not supported.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply