Checkpoints and Performance

  • I am an application developer, not a DBA but my company doesn't have one so pardon my ignorance....

    In a 5 minute sampling, the following activity occurs in my DB

    7900 Selects

    112 Inserts

    913 Updates

    Usually about every hour or so, but closer to 20 minutes during busy time... SQL performs its automatic checkpoints. I can see the checkpoints/sec climb (through perfmon) and the avg disk queue spikes to 300-500, during that time performance for my end user is hideous with long wait times.

    Is the checkpoints only flushing the insert and update commands to the hard disk or does it process the SQL statements also? Is the frequency determined by how many statements, how many records affected, or the amount of data being updated?

    I tried changing the recovery Interval to 10 minutes because I thought that would make the checkpoints happen more often and not last so long but I never saw any checkpoints/sec after that... but my users started complaining of slowness.

    Any insight, tips and tricks, or reference would be extremely helpful. Thanks in advance....

  • SQL Server creates a checkpoint in order to keep the recovery time to a minimum (as you've already read if you're getting into changing the recover interval). When a checkpoint is issued, SQL Server writes all dirty pages to disk. So any data modfications that have been made to the data pages in the memory buffers are committed to the data file. SELECT statements should not have an effect on this.

    Changing the recovery interval should be a last resort. Also, by lengthening the recovery interval to 10 minutes (the default is 1), you are not incresing the frequency of the checkpoints, you are increasing the time between them. When a checkpoint occurs, there will be more dirty pages to flush than when it is set at 1 min.

    Is this a production system? Can you tell us a bit about the storage subsystem? If checkpoints are causing disk I/O bottlenecks, the checkpoint is not the problem, it is the disk throughput.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • What do you want to now about the system?

    The DB server is live. it has 2 SATA II hardrives which is set up with mirroring. Disk 1 is mirrored to 2. So technically, we have 1 hard drive to work with. It is partitioned to seperate OS/Applications, Paging, and Data.

    We are looking at swapping to a 6 hardrive server that is being used as a backup for something currently....

    In your post, you said that setting a recovery interval of 10 minutes just increases the time between checkpoints?

    Not thinking about recovery...... Would a long time between checkpoints hurt performance? My users were compaining about slowness but I never saw any checkpoints in the perfmon log.... Could I go the whole production day without a checkpoint and just issue the checkpoint command before and after production time?

  • This sounds like a disk bottleneck to me. You've got your OS, paging file, and DB files all on one physical disk. Regarless of how you logically partition the disk, they are one one disk and sharing the spindle and physical disk.

    Regarding the checkpoint interval, SQL Server will issue the checkpoint as often as it needs to to keep the recovery time (estimated) to the Recovery Interval setting. If Recovery Interval is left at the default of 1, SQL Server will issue checkpoints more often so as to keep the recovery time down to 1 minute. If you change the Recovery Interval to 10, SQL Server will not have to checkpoint as often as you are now saying that it is OK for SQL Server to take up to 10 minutes to go through recovery. Keep in mind that the recovery process happens when you start the SQL Server service so what you are saying is that your are OK with it taking 10 minutes to start SQL Server.

    Increasing the time between checkpoints will only increase the amount of work a checkpoint has to do. I would recommend leaving the recovery interval at the default of zero (meaning 1 min or less) so that your checkpoints have less work to do. If your checkpoint activity is so much that your disk queue lengths are hidious, it may be time to consider faster disks.

    If your users are complaining about slowness, I would run a perfmon and SQL trace and try to capture what is happening at the time of slowness.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • John Rowan (7/10/2008)


    This sounds like a disk bottleneck to me. You've got your OS, paging file, and DB files all on one physical disk. Regarless of how you logically partition the disk, they are one one disk and sharing the spindle and physical disk.

    Agreed.

    Check the following perfmon counters please

    Physical disk:Avg sec/read

    Physical disk:Avg sec/write

    Physical disk:% idle time

    What's the chance of getting more drives for this server?

    Could I go the whole production day without a checkpoint and just issue the checkpoint command before and after production time?

    Theoretically yes, but if you get a service restart for any reason, you could end up with a couple hours of downtime, as SQL does recovery of several hours worth of transactions before bringing the DB on line.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I know the hardware limitations are an issue but until I can get a request through approvals, appropriations, and budgeting... I have to work with what I got.... Sorry..

    Again questions...

    1. Is the the SQL checkpoint all transactions or only write transactions? Is it how many transactions or some size criteria?

    2.

    Example:

    In our system, every time a user saved an account the system looked to see if a handful of fields had changed. For each value that changed, it did a lookup to SQL to get the old value. Then it appended the change information to an audit log (basically an unlimited field).

    When we disabled our field change logging temporarily, things got a lot better in terms of performance.

    So was it the lower lookups that made the checkpoints less of an issue? Or was it the update to the memo field?

    3. Currently we set the Recovery Interval to 15 minutes so that we get fewer checkpoints during the day. Does this hurt performance? Why and how?

    4. What about if I set up a job to run a 'checkpoint' statement like every few minutes? Could that help? How and Why?

    Thanks for helping a desperately uninformed Application Developer.... Thanks again.

  • 1. As we've discussed, at a checkpoint, SQL Server writes all dirty pages to the disk. Dirty pages are pages where data modifications have been made but not committed. SELECT statements do not modify data and therefore do not create dirty pages. All INSERT, UPDATE, and DELETE statements create dirty pages that could/would impact the checkpoint.

    2. I'm guessing that these audits are being done in triggers and that they are using the inserted/deleted pseudo tables? When SQL Server has a need to reference the inserted/deleted tables, it must go to the transaction log to build them. The transaction log is allows for sequential access only, meaning SQL Server must start at the end of the log and read through it until it has the tables built. This means extra I/O and overhead during the trigger execution. Disabling the audits would shorten the length of all transactions where an audit may take place and give you better performance and aleviate some of the I/O burden. If your disk is the bottleneck, I can see how this would help a bunch.

    So was it the lower lookups that made the checkpoints less of an issue? Or was it the update to the memo field?

    From #1 above, lookups (SELECTs) do not create dirty pages and have no impact on the checkpoint so it is probably the additional overhead from the modifications done by the triggers in combination with the extra I/O put on the transaction log during the trigger execution. At this point, SQL Server would be attempting to access the transaction log and the data file at the same time which means contention for you since they are on the same disk.

    3. Setting the recovery interval to a value of 15 minutes means that SQL Server will perform checkpoints less frequently than it would using the default setting. This may or may not have an impact on performance. The answer to this is 'it depends'. Less frequent checkpoints could mean that there is more work to do at each checkpoint and you'll experience more problems. Checkpoints aren't the time when SQL Server writes dirty pages to disk. There is also a lazy write process that helps with writing dirty pages to disk whenever the cache space is needed for another process or query. So depending on how often the lazy writer is working between your elongated checkpoints, they may not be as severe (but then again, levels of lazy writes/sec consistently above 20 means a possible memory bottleneck).

    So again, it depends. You may consider making the change and monitoring it very closely with perfmon to watch the lazy writes/sec, checkpoints pages/sec, page life expectancy, and disk queue length.

    4. Same as #3 above. It may, it may not. There are more factors here than just the checkpoint. I hope I'm not confusing you more.

    My personal opinion here is that focusing on the checkpoint frequency and impact is not the correct fix; it's just a very short term band aid. Your problem here is your disk configuration/size/speed. You've got a disk bottleneck and tweaking the behaviors of SQL Server to make up for it simply is not a viable long term fix.

    John Rowan

    ======================================================
    ======================================================
    Forum Etiquette: How to post data/code on a forum to get the best help[/url] - by Jeff Moden

  • bumoftheday (7/18/2008)


    I know the hardware limitations are an issue but until I can get a request through approvals, appropriations, and budgeting... I have to work with what I got.... Sorry..

    Do you have a request in? What disk setup are you planning to get?

    Be careful with the recovery time. The longer you set it, the longer SQL will take to come online if the servoce stops for any reason. Setting it to 15 min means that you have at least a 15 min downtime should the service fail for any reason.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Checkpoints aren't the time when SQL Server writes dirty pages to disk. There is also a lazy write process that helps with writing dirty pages to disk whenever the cache space is needed for another process or query. So depending on how often the lazy writer is working between your elongated checkpoints, they may not be as severe (but then again, levels of lazy writes/sec consistently above 20 means a possible memory bottleneck).

    So again, it depends. You may consider making the change and monitoring it very closely with perfmon to watch the lazy writes/sec, checkpoints pages/sec, page life expectancy, and disk queue length.

    Through other posts, we have Perfmon watching the server for avg disk queue length (reads and writes), checkpoints/sec, and few other basic ones...

    Which sections in Perfmon do I look for to find the counters that can monitor the lazy writes/sec & page life expectancy?

    Any other useful monitors/counters that could help?

  • Disk queue length isn't that great a counter to watch. SQL can and will drive it high with read-ahead reads and if you're on a SAN it's hard to interpret

    Personally, I like the the Disk avg sec/read and avg sec/write (both should be 10-50 ms) and the % idle time (the higher the better). You can find lazy writes/sec, checkpoint pages/sec and page life expectency under SQL Server Buffer Manager

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Could I go the whole production day without a checkpoint and just issue the checkpoint command before and after production time?

    How much RAM do you have on this system? How much is actually allocated to SQL Server? How much is actually being used day-to-day? Before the dirty pages are written to disk, they are on buffer cache. If you leave all of them there for a day, you might run out of RAM. I would only recommend if you from time to time, command a checkpoint when necessary. But again, if your bottle-neck is on disk IO, it will not help the long run. Your eventual goal is to have the system run smoothly auto-piloted.

  • 2 year old thread.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • This is a very informative thread! I am dealing with a similar issue and according to your specs, my avg disk writes and reads are off the charts! They are pegging 100 for extended periods. We are running separate spindles for data files, system files, and log files on a SAN and we still see performance hits on a production db. I too have been monitoring the checkpoint in sysprocesses and ours hangs around 274125 in disk io and 418625 in cpu. This stays at this level throughout the day. Does this indicate poor disk setup?

  • The values in sysprocesses are cumulative since that session started. Since checkpoint runs on a system spid, it never disconnects.

    Hence checkpoint has done 274125 in disk io and 418625 in cpu since the SQL Server last started.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks. Could you tell me how to read the high disk secs/read/write numbers?

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply