High PLE (20K+) and Low Buffer Cache hit (22%)

  • Hi

    We have Virtual SharePoint DB server with 16GB ram, 1 socket 4 virtual CPU's running at 2ghz.

    Yesterday around 6:30am the CPU on this box started going haywire (90%), when looking at the Performance stats for SQL I see a very high PLE, in the region of 20,000 and very low Cache hit 22%.

    Yes on the UAT box with the same build, it was running at a PLE of 100-200 average and Cache hit ratio of 90%+, which is what I expect to see.

    I've suggested doubling the memory to 32GB, but I think this is papering over the cracks and I'd like to know if there is anything else that can be done.

    Regards

    Jason.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • What version of SharePoint? Is this virtual box standalone or do you have other dbs residing along with this? Have you followed all of the best practices or run the best practice analyzer? SharePoint is pretty good about giving error information in its logs in more current versions, have you discussed with your SharePoint administrator?

  • Jason-299789 (4/17/2015)


    Hi

    We have Virtual SharePoint DB server with 16GB ram, 1 socket 4 virtual CPU's running at 2ghz.

    Yesterday around 6:30am the CPU on this box started going haywire (90%), when looking at the Performance stats for SQL I see a very high PLE, in the region of 20,000 and very low Cache hit 22%.

    Yes on the UAT box with the same build, it was running at a PLE of 100-200 average and Cache hit ratio of 90%+, which is what I expect to see.

    I've suggested doubling the memory to 32GB, but I think this is papering over the cracks and I'd like to know if there is anything else that can be done.

    Regards

    Jason.

    Quick question, what are the SQL Memory configurations (Min and Max Memory)? Looks to me that the OS is somewhat short of memory, 16Gb suggests around 8Gb Max Memory for the SQL Server.

    😎

  • Its a SharePoint 2013 install.

    I didn't set up the Sharepoint boxes, the SQL server box as far as I can tell is a single box within the farm and holds all of SharePoint's databases.

    I have no idea if there are other virtual boxes on the hard box, but would presume there are, or that the hard server is dedicated to hold all the virtual servers that make up the farm. The outsourced infrastructure guys are not very helpful.

    I've sat with the SharePoint admin and hes checked the logs, but cant see anything specific, but even with the Crawl paused there was a lot of activity from the Crawl service hitting the SQL Service every few seconds showing up in a trace on the server every few seconds a request would be fired, this didn't happen on the UAT environment if a crawl was paused.

    Eirikur,

    The SQL server had a Max memory setting of 12GB (~75%), with a default min set. Perfmon showed heavy but fast disk IO (6-7ms), also I noticed the MDP was set to 1.

    We have requested an extra 16GB to be assigned to the Virtual server, but as an interim measure we restarted the SQL server service, and since then the PLE and CHR are more in line with what I would expect 150-200ms and 95%+ respectively.

    The DMS Crawl also went from 0.6 docs per second up to around 7 per second with the odd peek of 25/sec.

    I'll catch up with the SharePoint admin this morning to see how things ran over night and if the crawl is still performing well.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • MaxDOP on SharePoint should be set to 1, to my knowledge this has not changed with 2013. I'm not so much interested in the physical set up of the virtual environment though this could prove helpful. At this point what I am after is from the database side do you have a pure SharePoint workload or are you mixing your workload with other applications. SharePoint does best on SQL Server when it is isolated due in part to the specific configurations required.

    Let's expand our view of things a bit, what do your disk metrics look like in UAT vs PROD? Take a look at this article https://www.simple-talk.com/sql/database-administration/great-sql-server-debates-buffer-cache-hit-ratio for some background on the counter that you are using. To me this sounds like you might have a disk throughput issue going here and memory would mask this. To test this theory grab your virtual file stats per database and store them in a table on a regular basis, you would be the best judge with regard to your workload if this should be at a 15 minute basis or more frequent or less frequent for troubleshooting. From here take the differential from timestamp 1 to timestamp 2 and this will be your I/O per file. Use https://msdn.microsoft.com/en-us/library/ms190326.aspx?f=255&MSPPError=-2147217396 as a reference for the dmv for the virtual file stats. If you need help on how to gather these metrics let me know. Once you have these look to see if you are pushing a large volume of data and also seeing large io_stall_* metrics on a specific file or across a general disk and you can take this back to your infrastructure folks.

    I hope this helps you out, please let me know if you need more information.

  • Jason-299789 (4/20/2015)


    Its a SharePoint 2013 install.

    I didn't set up the Sharepoint boxes, the SQL server box as far as I can tell is a single box within the farm and holds all of SharePoint's databases.

    I have no idea if there are other virtual boxes on the hard box, but would presume there are, or that the hard server is dedicated to hold all the virtual servers that make up the farm. The outsourced infrastructure guys are not very helpful.

    I've sat with the SharePoint admin and hes checked the logs, but cant see anything specific, but even with the Crawl paused there was a lot of activity from the Crawl service hitting the SQL Service every few seconds showing up in a trace on the server every few seconds a request would be fired, this didn't happen on the UAT environment if a crawl was paused.

    Eirikur,

    The SQL server had a Max memory setting of 12GB (~75%), with a default min set. Perfmon showed heavy but fast disk IO (6-7ms), also I noticed the MDP was set to 1.

    We have requested an extra 16GB to be assigned to the Virtual server, but as an interim measure we restarted the SQL server service, and since then the PLE and CHR are more in line with what I would expect 150-200ms and 95%+ respectively.

    The DMS Crawl also went from 0.6 docs per second up to around 7 per second with the odd peek of 25/sec.

    I'll catch up with the SharePoint admin this morning to see how things ran over night and if the crawl is still performing well.

    In my experience, 4GB is not enough for the OS, especially if there is any other activity such as RDP/SharePoint etc.. PLE measured in ms means it's too low and although the CHR is close to acceptable that still depends on the activity, 4-5% can really make a big difference. My guess is that the additional 16Gb will more or less resolve the biggest part of the issue, suggest you start 50/50 and monitor closely the OS memory paging on one side and the SQL memory metrics on the other.

    The reason for the Crawl being slow (0.6 is slow unless the docs are very large ) and slowing down more than 10 fold is simply memory pressure.

    😎

  • Thanks for all the help its given us a place to start looking deeper at the issues.

    To clarify a few points,

    1) The SQL Install on the DB server is dedicated to Sharepoint DB's and its the only thing running, that said knowing the infrastructure people who do the install they've probably done a tick-box next install.

    2) The PLE timing should have been in seconds not ms so should have read 150-200 seconds (my bad!!!), still lower than I would expect but more acceptable, than 20000 seconds.

    3) Disk through put Read/Write times are consistently 7-10ms with peaks of 0.5 seconds under heavy load.

    4) Data throughput were pretty much balanced between reads and writes having 250-300kb/s each, Id like to get this higher but I think we've been allocated a slice on a disk rather than dedicated disks on the san.

    I agree the memory is the key especially the max level, but I've generally worked on an initial 75/25 split for the DB/OS, but that was mainly on server 2003/2008, these are using Server 2012 as the OS, so I'm not so familiar with the set up.

    I was thinking of starting at a 60/40 split, so approx. 20gb for the DB and 12 for the OS, and then running a monitor to check how things were going, and adjust accordingly.

    I checked yesterday with the admin and its now idling along at 7 docs per second, but there are 4 docs that are causing errors, which the SP admin thinks could be memory pressure as they are a few 3-4MB in size, or a corruption.

    _________________________________________________________________________
    SSC Guide to Posting and Best Practices

  • Personally I like Jonathan Keyhaias' calculation for determining a starting point for memory and it is what I use for my production servers. Considering that you are indicating that the SQL Server is running alone on the server going to half of the OS memory should be unnecessary but you can check the ring buffer to double check that you are not under OS memory pressure. The link for the article from Jonathan is here https://www.sqlskills.com/blogs/jonathan/wow-an-online-calculator-to-misconfigure-your-sql-server-memory/. At a high level the formula goes 1GB for the OS, 1GB for each 4GB of RAM installed from 4-16, then 1GB for every 8GB of RAM above 16GB installed. I've found that this gives plenty of room for the OS in our environment but as always your mileage may vary due to environmental differences, test the config in a non prod environment, disclaimer, disclaimer and all that stuff.

    From what you describe I think you have two potential fronts that you have to take a look at. First, is adding RAM going to be a bandage fix and the mileage on this going to be one that may only last for a short period of time before you are right back where you are now? I can't answer this for certain because I don't know your environment but in heavily virtualized environments my experience points towards the second front to look at, and this will likely be a root cause. The second thing to look at is what is the underlying storage doing from an I/O perspective as compared to its maximum potential IOPS. In many cases what you may find is that the storage is saturated and that while adding RAM may alleviate some of the problem it can't eliminate all of it. If I were in your position I would try to get a meeting set up with the infrastructure folks to see what is going on with the storage. In my opinion 7-10ms consistent wait time is on the upper end of acceptable and combined with the cache issue you are seeing along with the spikes in latency this points to a storage subsystem issue.

    I hope this helps you out

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply