Rob,
To summarize your system here is what I have gathered so far:
RamRaid QuantityBreak down
Lab32GBRaid 1-04OS,Trans,DWH,Logs
Prod24GBRaid 1-04OS,Trans,DWH,Logs
Things we don’t know
•Is it on a SAN
•Processor Types and Cores Identical?
•Maximum Memory
•Minimum Memory
•Max DOP setting on each system
What I understand of your problem is this:
In Lab you can run four simultaneous processes and the event translation takes around 50-60ms. In production there are 8 simultaneous processes which take nearly 10x the time (550ms to 650ms). There also appears to be a generally high CPU utilization of 50% in production; backups are also causing contention with I/O when they kick off.
Based on that:
SAN:
I’m asking about your SAN because if indeed you are on a SAN your I/O stats could be lying to you. I believe SQLIO is based off WMI events and Perfmon counters. The metrics you have may or may not be accurate.
Question:
Are these disks on a SAN or no?
RAM:
The ram is different from production to LAB; in this case your lab has more. If your production server comes under pressure from too little RAM it can start paging out and causing I/O issues, CPU issues etc. So for instance with 24 GB in place maybe you have 20GB as your max and 8GB as your minimum setting. This would generally be ok provided that you don’t have other external programs running on this server which operate out of SQL Servers reserved memory pool. This would include items from within SQL such as CLR’s and Linked server calls.
When SQL comes under pressure it will try flipping out what’s in memory but only from its reserved pool; any other memory being freed relies on the OS to release the memory. Again this will include linked server calls and CLR’s etc. In earlier editions of SQL; such as 2005 I sometimes have found it necessary to lock MIN and MAX memory in place with the same values as it couldn’t release the RAM in a timely fashion; but that is neither here nor there.
Question:
What are you minimum and maximum memory settings on your production server? Are there any other external programs running on the server? Are you using CLR’s extensively?
Settings can be gathered via the GUI or via sp_configure.
--Allow all options to be visable
exec sp_configure 'show advanced options',1;
GO
--Execute this now as opposed to the next restart of the server
Reconfigure with override;
go
--View all the settings
exec sp_configure;
go
MaxDOP:
Depending on how many cores you have and if this process is going parallel this setting matters a great deal. If you have a MAXDOP of 0 and this is going parallel then even a single processing event could be using all the cores on the system.
Question:
How many cores are on your system and what is the Maximum Degree of parallelism that is currently set?
See sp_configure above.
--Taken from http://msdn.microsoft.com/en-us/magazine/cc135978.aspx
SELECT TOP 10
[Wait type] = wait_type,
[Wait time (s)] = wait_time_ms / 1000,
[% waiting] = CONVERT(DECIMAL(12,2), wait_time_ms * 100.0
/ SUM(wait_time_ms) OVER())
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE '%SLEEP%'
ORDER BY wait_time_ms DESC;
*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it
Clustered Indexes:
If you’re missing clustered indexes / Primary Keys it can cause your CPU to go nuts as SQL tries to satisfy a query across a heap. It’s worth taking a looks at the db’s in question as well as any others on the system to see if this is the case.
--Missing Primary Keys:
SELECT
t.TABLE_CATALOG
,t.TABLE_SCHEMA
,t.TABLE_NAME
,t.TABLE_TYPE
FROM
INFORMATION_SCHEMA.TABLES t
LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS pk
ON t.TABLE_CATALOG = pk.TABLE_CATALOG
AND t.TABLE_NAME = pk.TABLE_NAME
AND t.TABLE_SCHEMA = pk.TABLE_SCHEMA
AND pk.CONSTRAINT_TYPE = 'PRIMARY KEY'
WHERE
pk.TABLE_NAME IS NULL
AND t.TABLE_TYPE='BASE TABLE'
ORDER BY
t.TABLE_CATALOG
,t.TABLE_SCHEMA
,t.TABLE_NAME
--Missing Clustered Indexes
SELECT name
FROM sys.objects
WHERE type = 'U'
AND object_id NOT IN
(SELECT object_id FROM sys.indexes WHERE index_id = 1)
*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it
Missing indexes:
This can also be a point of contention as your using statics to satisfy queries. It might be worth looking into it to see if you can optimize those tables and queries causing high CPU.
--Count of missing indexes
SELECT
DatabaseName = DB_NAME(database_id)
, count(*) as [Number Indexes Missing]
FROM sys.dm_db_missing_index_details
GROUP BY DB_NAME(database_id)
ORDER BY 2 DESC;
--Top ten missing indexes based on usage since last restart
SELECT TOP 10
[Total Cost] = ROUND(avg_total_user_cost * avg_user_impact * (user_seeks + user_scans),0)
, avg_user_impact
, TableName = statement
, [EqualityUsage] = equality_columns
, [InequalityUsage] = inequality_columns
, [Include Cloumns] = included_columns
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
ON d.index_handle = g.index_handle
ORDER BY [Total Cost] DESC;
*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it
If all this doesn’t point you in the right direction or give you a better idea of what could be going on you have another couple of options that are more time consuming; which is why I left them at the end.
Install the trial version of SQLSentry or SQLMonitor ; monitor your lab environment and production and see what is going on. Each of these products will provide metrics and a handy GUI for you to help better analyze your environment.
Finally and least appealing to me (because it’s the most time consuming) would be to install SQL 2012 somewhere and hook up the Distributed Reply feature; this would allow you to capture and then reply what is going on in your production environment directly in your lab. I hope this helps.
Teal Canady
Sr. DBA
Lieberman Research Worldwide
*Please excuse any spelling or grammatical mistakes; I'm just a DBA: )