Doubled the size of my SQL Server and now getting bad performance

  • We recently doubled the size of our SQL Server from two 4-way processors and 32GB of RAM to two 8-way processors and 128GB of RAM. We are running Windows Server SP2 Enterprise 64-Bit and SQL 2005 SP3 Enterprise 64-Bit. SQL has a max memory setting of 118GB, and SQL is the only thing running on the box.

    Over a 2-3 day period, our wait times slowly grow from around 3,000 seconds per hour to around 4,500. On the third day, our wait times skyrocket to 30,000 over a two hour period, and SQL is very slow to respond. Our CPU remains steady around 30%, there is no memory paging, and the Disk I/O is fine. Ignite is saying all of our wait times are Memory/CPU related. We are trying to find the problem, but non of our monitoring tools point to an issue, except that our Memory/CPU wait times are through the roof.

    Dropping the Max Server Memory just 1GB will remedy the problem, and the cycle starts again. This led us to believe there is a memory problem, but giving the OS 10GB of memory should be more than sufficient.

    Any ideas on what to monitor or thoughts on what the problem might be would be greatly appreciated.

  • whats does the available memory perfmon counter say?. How much memory is SQL actually using?

    a max memory setting of 118GB is the buffer pool only, other SQL caches will add to that. Try Decreasing max memory by a reasonable amount (say down to 110GB)

    ---------------------------------------------------------------------

  • Are you seeing any CXPacket waits? I'm wondering if the upgrade has caused more parallel processing and this is causing issues?

  • I am not seeing an abnormally high number of CXPacket waits.

  • I am showing the available memory as 7.5GB right now. We reset the Max Memory about 36 hours ago.

  • My knowledge of wait types is limited. Are CXPacket waits the only way to see problems with parallelism? Is Ignite possibly bundling up different wait types under "Memory/CPU" that may be concealing the issue?

  • What's the exact version?

    What wait types are you seeing?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • post the results of this query:

    SELECT TOP 10

    [Wait type] = wait_type,

    [Wait time (s)] = wait_time_ms / 1000,

    [% waiting] = CONVERT(DECIMAL(12,2), wait_time_ms * 100.0

    / SUM(wait_time_ms) OVER())

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT LIKE '%SLEEP%'

    ORDER BY wait_time_ms DESC;

    ---------------------------------------------------------------------

  • The exact version is 9.00.4340.00.

    During the peak of the problem, we are seeing 95% of the waits are of type "Memory/CPU". Followed by Writelog, PageLatch_SH, CXPacket, Latch_EX, ...

    This is what I am getting from Ignite. Again, I am not sure if Ignite is bundling up several wait types into this category of Memory/CPU.

    Being that the "solution" is to drop the Max Memory Setting, it seems to be a memory issue, but why aren't we paging?

  • Wait typeWait time (s)% waiting

    CXPACKET49125222.81

    SQLTRACE_BUFFER_FLUSH30631514.22

    BROKER_TASK_STOP1947059.04

    SOS_SCHEDULER_YIELD1783828.28

    LATCH_EX1666517.74

    PAGEIOLATCH_SH1644227.63

    WRITELOG1412016.56

    BROKER_RECEIVE_WAITFOR1181395.49

    BACKUPIO998744.64

    BACKUPBUFFER985254.57

  • SQLTrace and BrokerTask are system waits, they should be ignored.

    That high latch wait worries me. Can you query sys.dm_os_latch_stats and see what the top latch wait types are? Other than that, SOS_Scheduler_yield implies your queries are inefficient or your server does not have enough CPU. Then PageIOLatch, WriteLog, BackupIO and backupBuffer are mostly disk-related. The first 2 are an indication that the IO subsystem may be underpowered.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'll defer to Gail on this, but CXPacket is the top wait type so I'd definitely look at what is getting parallelized and perhaps modify the Threshold for Parallelism setting. You can query sys.dm_exec_cached_plans and sys.dm_exec_query_plan() and query the XML plans to find the queries that are being parallelized.

  • We just doubled the number of CPUs, going from two Quad-Cores to two 8-Cores, and the queries haven't changed.

    If parallelism is the problem, is it best to set the MAXDOP on the server or tweak individual queries? Right now it is set to 0 at the instance level.

    Everything I have read about low CPU %, but high CPU wait time, says parallelism is the problem, however, the number of CXPacket waits is extremely low compared to the Memory/CPU waits... Maybe I am just relying on Ignite too much.

    Here are the results of sys.dm_os_latch_stats ordered by wait_time_ms.

    latch_classwaiting_requests_countwait_time_msmax_wait_time_ms

    BUFFER1982344202184392033406

    ACCESS_METHODS_SCAN_RANGE_GENERATOR245483861620638432688

    NESTING_TRANSACTION_READONOY44454274179456218

    ACCESS_METHODS_DATASET_PARENT46034711492844360

    ACCESS_METHODS_HOBT_VIRTUAL_ROOT11995498567297

    NESTING_TRANSACTION_FULL845905183824125

    ACCESS_METHODS_ACCESSOR_CACHE9712997324203

    LOG_MANAGER1607230469

    ALLOC_IAM_PAGE_RANGE_CACHE31219782

    ALLOC_FREESPACE_CACHE10458107931

  • Rather than knee-jerk reducing parallelism maxdop, I'd look at the queries running in parallel and see what else they're waiting for and see if they can be optimised.

    I don't know what Ignite does, but Memory/CPU is not a wait type. It's probably a collection and without knowing what's in that collection it's hard to say anything. Most people classify CXPacket as a CPU wait (it's not memory and it's not IO).

    Last time I saw performance degrade after a CPU upgrade it was because the queries were written terribly, got awful execution plans and got paralleled both inappropriately and skew.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I just got word from the ultimate authority of all things storage engine. The combination of CXPacket and SOS Scheduler along with the Access Method latch types indicates you're doing a ton of concurrent parallel table scans. Check indexing, check the queries make sure they're SARGable (can use index seeks)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply