system FlushCache happening often

Question

Post reply

system FlushCache happening often

cheshirefox

Hall of Fame

Points: 3464
More actions
September 24, 2012 at 10:28 am

#262919

SQL 2012 - enterprise... OLTP environment...
Experiencing the following - several times per hour. Has anyone experieced the following:
FlushCache: cleaned up 513008 bufs with 448519 writes in 75641 ms (avoided 92621 new dirty bufs) for db 10:0
average throughput: 52.99 MB/sec, I/O saturation: 104, context switches 5758
last target outstanding: 80800, avgWriteLatency 0
The system is generating the flush cache and causing severe performance degradations.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

cheshirefox Hall of Fame Points: 3464 More actions · Answer 1

We tried all sorts of fixes, including contacting Microsoft. Microsoft did not provide any feedback.

Finally - we executed the following in order...

DBCC FREEPROCCACHE

EXEC sp_recompile 'dbo.StoredProcedureName' (we preformed sp_recompile for over 100 Procedures).

DBCC FREEPROCCACHE

Once the final FreeProcCache finished - the server became operational w/out issue.

We'll be scheduling a daily process to execute the FreeProcCache.

Dimbulbz Ten Centuries Points: 1149 More actions · Answer 2

We are occasionally experiencing the same message - perhaps once a day, is there any advantage to flushing the PROCCACHE more often? As far as I know, we don't run a lot of stored procs on this system. Mostly dynamic SQL

Tommy Bollhofer SSChampion Points: 14940 More actions · Answer 3

cheshirefox (9/25/2012)
We tried all sorts of fixes, including contacting Microsoft. Microsoft did not provide any feedback.
Finally - we executed the following in order...
DBCC FREEPROCCACHE
EXEC sp_recompile 'dbo.StoredProcedureName' (we preformed sp_recompile for over 100 Procedures).
DBCC FREEPROCCACHE
Once the final FreeProcCache finished - the server became operational w/out issue.
We'll be scheduling a daily process to execute the FreeProcCache.

FYI, CU3 for SQL Server 2012 has been released. It is available via:

http://support.microsoft.com/kb/2812412

There are a number of fixes in this particular CU which address a host of performance issues. Most notably:

SQL Server 2012 experiences performance issues in NUMA environments

http://support.microsoft.com/kb/2819662

FIX: CPU spike when there is no load on a server after you install SQL Server 2012 on the server

http://support.microsoft.com/kb/2813214

FIX: A query that runs against a DMV enters an infinite loop in SQL Server 2008 R2 or in SQL Server 2012

http://support.microsoft.com/kb/2819662

FIX: Poor performance in SQL Server 2012 when you run a SQL Server trace

http://support.microsoft.com/kb/2803529

FIX: Memory leak when you run queries against a temporary table in a nested stored procedure in SQL Server 2012

http://support.microsoft.com/kb/2803065

Tommy

Follow @sqlscribe

SpringTownDBA SSCertifiable Points: 6917 More actions · Answer 4

cheshirefox (9/24/2012)
SQL 2012 - enterprise... OLTP environment...
Experiencing the following - several times per hour. Has anyone experieced the following:
FlushCache: cleaned up 513008 bufs with 448519 writes in 75641 ms (avoided 92621 new dirty bufs) for db 10:0
average throughput: 52.99 MB/sec, I/O saturation: 104, context switches 5758
last target outstanding: 80800, avgWriteLatency 0
The system is generating the flush cache and causing severe performance degradations.

FREEPROCCACHE isn't the answer to this issue.

http://blogs.msdn.com/b/psssql/archive/2012/06/01/how-it-works-when-is-the-flushcache-message-added-to-sql-server-error-log.aspx

FlushCache == Checkpoint. This message is logged when a checkpoint takes longer than the recovery interval. It's more of a warning than an actual error.

In your example above, SqlServer wrote 4GB (513008 x 8k) of dirty pages to disk, taking 75.6 seconds in the process. Taking a step backwards to right before the checkpoint started, you had 4GB of dirty data in the buffer pool that hadn't been flushed to the data file(s) (of course it's already in the log file on disk). If a crash occurred right then, when SQL attempts to recover the db, it would have to roll-forward/roll-back the transactions associated with that 4GB of dirty data to bring the db current. Just writing the 4GB of data to disk takes 75 seconds, so the total time to run recovery will very likely be at least 75 seconds. If the recovery interval is 60 seconds, then we're going to take longer to recover the database than we expected.

And that's the problem: You're dirtying data in the buffer pool faster than the checkpoint process can handle within your recovery interval.

Extreme Example: Consider a 2TB database on a server with 512GB of RAM. The logs could be on SSD's, but the data files are on a slow Raid-6 Array. You update 20 GB of data and it takes 20 seconds (fast log writes to the SSD's, yay!). Now there are 20 GB of dirty pages in the buffer pool that need to be written to the slow Raid-6 Array. They are likely to mostly be 8k random writes. If you were able to get 1000 write IOPS out of your Raid 6 array (not likely, but i'm being overly optimistic), then you'd be looking at 40+ minutes to write that data to disk. In other words, if your cluster failed over right after your "20 second" update, it would take over 40 minutes.

If your recovery interval is 60 seconds (default), are you okay with it taking 75 seconds? 120 seconds? 300 seconds? If so, then you can safely ignore the errors.

If you're getting server performance issues at the same time, then keep reading. Checkpoints aren't supposed to cause a negative performance impact to running queries. If they are, then your problem is IO.

Start capturing perfmon counters related to disk IO LogicalDisk(*) and/or PhysicalDisk(*), and any sql counters related to disk io ( reads/sec, checkpoint pages/sec, log flushes/sec, etc.) 1 sec interval is best but no more than 5 sec.

Watch how Current Disk Queue Length, Disk Reads/sec, Disk Writes/sec, Avg Disk Sec/Read, Avg Disk Sec/Write, etc correlate to the sql counters.

What are your baselines?

What happens to your current disk queue length and latency (Avg Disk Sec/Read, Avg Disk sec/write) when a checkpoint occurs?

Using SAN storage? Involve your SAN admin.

To resolve this issue, you have several options:

1. dirty fewer pages (drop extra indexes, use compression, tune queries, etc). Fewer dirty pages means less work each checkpoint.

2. reduce IO load overall (Add memory to reduce reads/sec, move busy tempdb to different drive, tune queries, etc)

3. increase IO write capacity (extra spindles in SAN, add SSD's, switch from Raid-5 to Raid-10, etc)

4. smooth out checkpoint's IO load (set a really high recovery interval and perform manual checkpoints. Don't go here until you've got a really good handle on the perfmon counters above and can prove that this helps.)