There have been several articles regarding the definition of the resource governor and how to set up resource governor. I do not wish to repeat all that theory. In this article I will be focusing only on how I used resource governor to help with redundancy for our report server. This is currently being load tested in our QA environment. So far the results have been quite satisfactory.
We have our main publisher database that serves 2,50,000 online clients along with our internal managers who set prices for different products. We see 3500 batches per second during non peak hours and that climbs up to 6500 batches per second during peak hours. Our CPU hovers around 15% to 25%. We do millions of dollars of volume everyday. We replicate 90% of this data to a report server where all Customer Service and Operations reports are run. Our issue had been that if the report server or distributor were to fail, we were not able to handle the load of both Customer Service and our clients running off our main database. My CTO handed me the project of finding a good solution that will work in this scenario.
Steps taken to mitigate this issue
When searching for viable solutions, I discovered I had a prejudice; I do not trust the SAN. This meant that clustering was out of the question for me since you cannot cluster without shared disks.
There are a few reasons for this prejudice. My main reason was that the SAN is like a black box. Historically when there has been an issue, I have not been able to get much support from the SAN Engineer. Secondly I lost trust in our SAN due to having lost some drives and the SAN Engineers being unable to figure out how or why the disks were lost.
Knowing I could not use high availability technology to help tackle the issue, I decided to study why we could not handle the load of both the online clients and Customer Service running their reports. I did quite a bit of monitoring on our report server. These were my findings.
- Parallelism - parallelism was noticed in lots of queries. Parallelism is not necessarily bad, but it could have an impact on performance.
- IO - It was noted that there was a queue on the disk where the data file resided.
- Memory utilization- Memory usage was pretty high.
- CPU - The CPU usage once in a while would spike. This usually happened when there was parallel processing,
- WAIT FOR RESOURCE - The biggest culprit was CXPacket. That was how I was able to figure out that there was a parallelism problem.
While I was monitoring the report server, I read a white paper on the resource governor. That gave me an idea. I knew that just resource governor was not going to solve this issue, and I also knew that our IO issue was not going to be solved by the resource governor. Despite knowing these facts, I decided to test my idea.
I set up a test environment with half the horse power of our live environment. The drives were set up properly for the best IO throughput, and replication was set up to mimic production.
The first step I took was to set up Row Level Versioning for the main DB. Everything seemed OK except that we started seeing heavy usage of the TempDB database. I altered the Temp database and added multiple files (one file for each CPU on the server). This helped quite a bit.
After that I asked the business what would be the performance degradation that they can handle in the case of having to run the reports on our main database. The business came back to me saying that they would not mind if the reports run by CS are slow, but they cannot let the clients notice a drastic slowdown in their regular usage. This gave me a good feeling that I probably would be able to make our system work with resource governor.
One of the basic things that has to be done when setting up resource governor is to ensure that the DAC is set up. I made sure to set it up as I did not want to get locked out of the main database. Since we had two different Windows authenticated users, it made my job easier to set resource governor up. When there are only two conditions that the classifier function has to check, it keeps it simple and efficient.
My priority was to make sure that there was no drastic decrease in performance for our online clients. Therefore I knew that I had to give at least 75% of all resources on the server for the clients that connect through our web site. As you all might know, resource governor has two built-in groups named Internal and Default. The Internal group cannot be altered since it used by the SQL Server itself. I was planning to use the default group for the reports that are run by Customer Service and a new group called WebGroup for the clients using the website. I altered the default group to use a maximum 40% of total memory present and 25% of CPU. For that I used the command below.
ALTER RESOURCE POOL [default] WITH
MAX_MEMORY_PERCENT = 40,
MAX_CPU_PERCENT = 25
I also wanted to make sure that there were no parallelism issues. For that I made sure that I set the MAX_DOP setting to 1. This would mean that all queries that were run in this group would not have any parallelism. The other issues that you might see in a report server is Runaway queries. In order to mitigate that issue, we can set up maximum amount of time a query can run in that group. For this I used the below given command.
ALTER WORKLOAD GROUP [default] WITH
MAX_DOP = 1,
REQUEST_MAX_CPU_TIME_SEC = 300
As you know, you can create eighteen user defined pools. In my case I did not need 18 pools; I needed only one additional pool. This pool was to take care of Web requests. Therefore higher resources had to be assigned to it.
Create RESOURCE POOL WebPool
MIN_CPU_PERCENT = 75, MAX_CPU_PERCENT=100,
MIN_MEMORY_PERCENT = 60, MAX_MEMORY_PERCENT = 100
As you can see, the rest of the resources available are given over to the new pool created just for the web requests from the client. When you give MIN_CPU_PERCENT a value of 75, which would mean that at all times 75% of the CPU time will be reserved for the pool. This guarantees that there will not be any contention for CPU time for the request coming from the web.
Our publisher database that caters to the requests of the web has 128 GB RAM. We had to make sure that the query request from the web does not do much paging due to insufficient memory. Therefore it was decided to give a minimum of 75 GIG RAM (Roughly equivalent to 60%) for this pool.
The newly created pool has to be assigned to the group WebGroup.
CREATE WORKLOAD GROUP WebGroup
USING WebPool ;
At this point, the initial setup is complete. Next is the most important part. When using resource governor, the main bottle neck is the classifier function. While writing the classifier function, try to use the KISS (Keep It Simple, Stupid) principle. The classifier function is called every time a login is attempted. Below is the code I used to create the classifier function.
CREATE FUNCTION Class_funct()RETURNS SYSNAME WITHSCHEMABINDING
DECLARE @workload_group sysname;
IF (UPPER(SUSER_NAME()) = UPPER('Domain\WebUser'))
SET @workload_group = 'WebGroup';
As you can see from the above code, it is a very simple function. The function contains just an IF statement. If the function cannot classify a request coming in, then that request will be assigned to the default group.
We have created this function in the Master database, but it is still not a classifier function. For this we have to run an Alter statement.
ALTER RESOURCE GOVERNOR
ALTER RESOURCE GOVERNORRECONFIGURE
Now the set up is complete. The resource governor is enabled with the classifier function. As you can see, I kept the classifier function very simple. This is because the classifier function is fired as soon as any login attempt is made to the SQL Server. This makes the login procedure simple. You can use look up tables but this could be a bottleneck in your application.
Just because we have implemented this set up does not mean that it is working as expected. One thing that should be done is constant monitoring. When using resource governor there are no accurate figures of the resource being used by each pool. However, we can get a good estimate of how the resource was allocated to each pool. To obtain this estimate, the below mentioned query will help.
rpool.name as PoolName,
COALESCE(SUM(rgroup.total_request_count), 0)as TotalRequest,
COALESCE(SUM(rgroup.total_cpu_usage_ms), 0)as TotalCPUinMS,
WHEN SUM(rgroup.total_request_count) > 0 THEN
END as AvgCPUinMS
LEFT OUTER JOIN
sys.dm_resource_governor_workload_groups AS rgroup
rpool.pool_id = rgroup.pool_id
Here is a sample output from that query.
The webpool, in the above example catered to 518,014 requests at average of 1 MS of the CPU time. In the case of Default pool, even though the number requests was low, the average CPU time used is very high. This is because the queries running from this pool are from the report servers. The report queries usually are long running queries since they will be retrieving historical data. Interestingly, the total request of internal pool is 0. The reason for this is that these requests will skip the classifier function and goes directly to the pool. Only the resource usage can be measured.
There is one problem with relying on DMV’s. That problem exhibits itself when the Server is restarted; you lose your data. If you want to keep historical trends of the usage, make sure that you regularly dump the data from the DMV’s into a table. By doing this, you can go back in time and check the details. Currently I am designing and developing a good way to store the historical data. Once you have these historical data that you can group by date, you can use this to get trend of resource usage. You will then be able to compare the values of the resource usage to any day before the report server was switched to the publisher and the current value. This should give you a rough idea of how the resource is being allocated to the different groups.
In this particular scenario, the resource governor seems to work. By using resource governor, it was possible to have the internal clients and web users share the data on the same server, but this does not mean that it will work for all scenarios. Before you decide what technology you want to use, you have to make a study of the options and then choose the right method.
I would like to thank Jason Brimhall for his valuable tips and his review. A special thanks to Steve Jones for being so patient with me and doing the final edit and publishing this article.