Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

SQL IO Performance Expand / Collapse
Author
Message
Posted Wednesday, November 14, 2012 8:25 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, April 24, 2014 11:36 PM
Points: 21, Visits: 92
Ok... so here are some new information:

I moved the TempDb to a separate ESATA drive on the server. My initial thought was that things seemed a little better but 5 hours later, if anything, the system is running slower than it was. Jobs that were running in 2 1/2 hours are now taking 2 3/4 hours.

I have some more specific statistics to share.

For SQLIO results, the Production server is running 881.66 IOPS with 6.88Mb/s on 8k and a 5 minute run. The Lab server runs 678.28 for 5.29Mb/s.

On Performance monitor, I'm seeing the following:
Avg Disk Sec.      Read         Write      Transfer
Primary 0.017 0.028 0.202
EventData 0.017 0.028 0.028
Log 0.000 0.003 0.003
TempDB 0.000 0.004 0.004


I just observed an event that has given me some pause. Maybe this will change the nature of the conversation. What I've been focusing on is procedure that fires off and moves data from a raw storage table to a normalized data warehouse table. This requires taking data from remote clients and looking up key values and doing data conversion and other actions such as detecting new data and creating keys, etc. On the lab server I can run 4 instances of this job offset by 5 min each at together they will clear 180,000 records an hour. The production server, not so much... However, I happened to be watching the server at the top of the hour when an external agent fires off a sync command to the remote clients and they begin dumping their records to the raw storage table. In this sync process, eight remote clients at a time will dump their records in batches of 100 records each. What caught my attention is that during this phase, the Database IO rate shot up to over 100Mb/sec and the processor which had been pegged at 50% utilization was suddenly peaking at 100%. After the sync process completed, the IO went back down to 0.3 and the processor dropped back to 50%.

Could there be some odd result because this process is fired from a Job? Of course, I'm just talking out loud now... working my way through the system a piece at a time.

Oh, and addressing an earlier question about the structure. The production server was just deployed with fresh scripts from the latest working Lab deployment that we have so I'm confident that the procedure logic is identical to what is working extremely well on the Lab Server.
Post #1384956
Posted Thursday, November 15, 2012 12:05 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 4:07 PM
Points: 953, Visits: 2,897
This is probably way off base ... but is there something external to the database getting in the way? For example an over enthusiastic virus checker.
Post #1384988
Posted Thursday, November 15, 2012 8:04 AM


SSCarpal Tunnel

SSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal TunnelSSCarpal Tunnel

Group: General Forum Members
Last Login: Friday, July 25, 2014 3:19 AM
Points: 4,320, Visits: 6,113
You could hunt-and-peck on this forum for days or weeks for a problem such as this (and I have seen that happen any number of times) and never find a true cause/resolution. Or you could hire a qualified professional to come on board for a few hours or a day or two and really get things nailed down (while simultaneously mentoring you on their methodology). Seems like a no-brainer to me...

Best,

Kevin G. Boles
SQL Server Consultant
SQL MVP 2007-2012
TheSQLGuru at GMail
Post #1385176
Posted Thursday, November 15, 2012 8:26 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, July 28, 2014 8:51 AM
Points: 132, Visits: 581
Are you using SSIS packages in your jobs? If so, make sure the server does not have any resource contention between the SSIS and SQL services.

I believe someone has mentioned this already but have you checked the Max Memory and Max DOP settings? Make sure you have the Max Memory set to something below the max server RAM.
Post #1385197
Posted Thursday, November 15, 2012 12:25 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, January 10, 2013 10:08 PM
Points: 5, Visits: 94
Rob,

To summarize your system here is what I have gathered so far:
Ram Raid Quantity Break down
Lab 32GB Raid 1-0 4 OS,Trans,DWH,Logs
Prod 24GB Raid 1-0 4 OS,Trans,DWH,Logs

Things we don’t know
• Is it on a SAN
• Processor Types and Cores Identical?
• Maximum Memory
• Minimum Memory
• Max DOP setting on each system

What I understand of your problem is this:
In Lab you can run four simultaneous processes and the event translation takes around 50-60ms. In production there are 8 simultaneous processes which take nearly 10x the time (550ms to 650ms). There also appears to be a generally high CPU utilization of 50% in production; backups are also causing contention with I/O when they kick off.

Based on that:
SAN:
I’m asking about your SAN because if indeed you are on a SAN your I/O stats could be lying to you. I believe SQLIO is based off WMI events and Perfmon counters. The metrics you have may or may not be accurate.
Question:
Are these disks on a SAN or no?
RAM:
The ram is different from production to LAB; in this case your lab has more. If your production server comes under pressure from too little RAM it can start paging out and causing I/O issues, CPU issues etc. So for instance with 24 GB in place maybe you have 20GB as your max and 8GB as your minimum setting. This would generally be ok provided that you don’t have other external programs running on this server which operate out of SQL Servers reserved memory pool. This would include items from within SQL such as CLR’s and Linked server calls.

When SQL comes under pressure it will try flipping out what’s in memory but only from its reserved pool; any other memory being freed relies on the OS to release the memory. Again this will include linked server calls and CLR’s etc. In earlier editions of SQL; such as 2005 I sometimes have found it necessary to lock MIN and MAX memory in place with the same values as it couldn’t release the RAM in a timely fashion; but that is neither here nor there.

Question:
What are you minimum and maximum memory settings on your production server? Are there any other external programs running on the server? Are you using CLR’s extensively?
Settings can be gathered via the GUI or via sp_configure.
--Allow all options to be visable
exec sp_configure 'show advanced options',1;
GO
--Execute this now as opposed to the next restart of the server
Reconfigure with override;
go
--View all the settings
exec sp_configure;
go


MaxDOP:
Depending on how many cores you have and if this process is going parallel this setting matters a great deal. If you have a MAXDOP of 0 and this is going parallel then even a single processing event could be using all the cores on the system.
Question:
How many cores are on your system and what is the Maximum Degree of parallelism that is currently set?
See sp_configure above.
--Taken from http://msdn.microsoft.com/en-us/magazine/cc135978.aspx
SELECT TOP 10
[Wait type] = wait_type,
[Wait time (s)] = wait_time_ms / 1000,
[% waiting] = CONVERT(DECIMAL(12,2), wait_time_ms * 100.0
/ SUM(wait_time_ms) OVER())
FROM sys.dm_os_wait_stats
WHERE wait_type NOT LIKE '%SLEEP%'
ORDER BY wait_time_ms DESC;

*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it


Clustered Indexes:
If you’re missing clustered indexes / Primary Keys it can cause your CPU to go nuts as SQL tries to satisfy a query across a heap. It’s worth taking a looks at the db’s in question as well as any others on the system to see if this is the case.

--Missing Primary Keys:
SELECT
t.TABLE_CATALOG
,t.TABLE_SCHEMA
,t.TABLE_NAME
,t.TABLE_TYPE
FROM
INFORMATION_SCHEMA.TABLES t
LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS pk
ON t.TABLE_CATALOG = pk.TABLE_CATALOG
AND t.TABLE_NAME = pk.TABLE_NAME
AND t.TABLE_SCHEMA = pk.TABLE_SCHEMA
AND pk.CONSTRAINT_TYPE = 'PRIMARY KEY'
WHERE
pk.TABLE_NAME IS NULL
AND t.TABLE_TYPE='BASE TABLE'
ORDER BY
t.TABLE_CATALOG
,t.TABLE_SCHEMA
,t.TABLE_NAME

--Missing Clustered Indexes
SELECT name
FROM sys.objects
WHERE type = 'U'
AND object_id NOT IN
(SELECT object_id FROM sys.indexes WHERE index_id = 1)

*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it

Missing indexes:
This can also be a point of contention as your using statics to satisfy queries. It might be worth looking into it to see if you can optimize those tables and queries causing high CPU.

--Count of missing indexes
SELECT
DatabaseName = DB_NAME(database_id)
, count(*) as [Number Indexes Missing]
FROM sys.dm_db_missing_index_details
GROUP BY DB_NAME(database_id)
ORDER BY 2 DESC;

--Top ten missing indexes based on usage since last restart
SELECT TOP 10
[Total Cost] = ROUND(avg_total_user_cost * avg_user_impact * (user_seeks + user_scans),0)
, avg_user_impact
, TableName = statement
, [EqualityUsage] = equality_columns
, [InequalityUsage] = inequality_columns
, [Include Cloumns] = included_columns
FROM sys.dm_db_missing_index_groups g
INNER JOIN sys.dm_db_missing_index_group_stats s
ON s.group_handle = g.index_group_handle
INNER JOIN sys.dm_db_missing_index_details d
ON d.index_handle = g.index_handle
ORDER BY [Total Cost] DESC;

*Apologies I forget where I got this from; if someone knows the author or URL I’ll add it

If all this doesn’t point you in the right direction or give you a better idea of what could be going on you have another couple of options that are more time consuming; which is why I left them at the end.
Install the trial version of SQLSentry or SQLMonitor ; monitor your lab environment and production and see what is going on. Each of these products will provide metrics and a handy GUI for you to help better analyze your environment.

Finally and least appealing to me (because it’s the most time consuming) would be to install SQL 2012 somewhere and hook up the Distributed Reply feature; this would allow you to capture and then reply what is going on in your production environment directly in your lab. I hope this helps.

Teal Canady
Sr. DBA
Lieberman Research Worldwide
*Please excuse any spelling or grammatical mistakes; I'm just a DBA: )
Post #1385304
Posted Friday, November 16, 2012 12:23 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, April 24, 2014 11:36 PM
Points: 21, Visits: 92
Teal,

Thank you for the long response. I've reviewed again a number of the points you brought up and have branched out on a few other areas as well. I'll summarize what I've learned and where my study of this problem is branching.

1. Some details missing:

None of the systems are on SANS. These are all local SCSI based RAID controllers with 4 or 5 volumes. All volumes are mirrors. Because of customer need, we've had to revert the production server back to the previous version of the software. I brought the production database back in-house to run on a second lab server that is set up with identical hardware to the first lab server. We had originally configured the second server to run the original version of the software in parallel with the new version.

I secretly hoped that there was just a difference with the Production versus lab environment, but the restored database performs at or below the production rates on the second lab server.

The lab servers both have 32Gb of RAM and SQL Server has a min/max setting of 0/30270. MaxDOP is 0 over 8 Cores. The original, well performing database is over 171Gb in size now and the other database is currently around 8 Gb.

These are dedicated SQL Servers and have no other network or service responsibilities. We will run the desktop application occasionally for development purposes. There is a management agent service as part of the application, but its activity on the server has not seemed to have an impact on the first lab server and currently it is turned off on the second lab server.

2. I did some more internet trolling and decided to check out sys.dm_io_virtual_file_stats. The article I read gave some formulas for a Write Latency calculation by taking the io_stall_write_ms and dividing it by the num_of_writes. The latency on the well performing server is 2 (log), 192 (Primary) and 18 (EventData). The equivalent latency on the other server is 2, 43, and 39. So again, the results seem to be reversed. The machine with the problem overall seems better off than the one I'd like to match performance wise.

3. My research has turned another direction. I ran a sample of the normalization procedure which I'm currently concerned about on the Production server and the Lab server with the Show Execution Plan turned on. I discovered that the Optimizer creates very different plans on the two machines. When I deployed to the second server, I ran another sample and got the same query plan as production. I've begun looking at the Schema again even though the production database was generated from scripts made from the lab server. The execution plan does not indicate that there is a missing index that could improve the speed of the performance, but the server with the problem is showing an extra loop join in critical points in the procedure. I'm going through the database structure again trying to understand why two different query plans would be generated.

I did take your advice and run the Wait stats query you provided. The first lab server has some much larger wait time numbers because it has been online longer but what I noted most of all about the results is that the top 10 on the well performing server were all system agents like Log Manager, Deadlock Search, Broker, Checkpoint, Dispatcher, etc. On the second lab server, WriteLog, Pageiolatch_sh and pageiolatch_ex all made it into the top 10. Each had a total wait time less than 1000, but they still ranked in the top 10 while none appeared in the top 10 of the other server.

Well, thank you again for the thoughtful response. I'm still diving deep.
Post #1385501
Posted Sunday, November 18, 2012 5:10 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, April 1, 2014 3:26 PM
Points: 316, Visits: 1,497
If you are seeing different plans, then I'd concentrate on that.

Any chance you can post both the good and bad versions of the plans?
Post #1386112
Posted Monday, November 19, 2012 12:04 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, January 10, 2013 10:08 PM
Points: 5, Visits: 94
Rob,

The difference in query plans could be two fold. After the system parsing takes place on a query for syntax etc. it then binds the objects and creates a logical query plan based on statistics; after that it does costing based on those stats to find the right plan. if the statistics are our of date; the sql optimzier will change the logical plan and then as a result the physical plan. Other factors are size of physical database between your environments.

If you have not already done so I would run sp_updatestats and the databases in question and possibly run alter index all on <your table> reorganize. if you have a second lab now where you can reproduce this I'd start there with the queries above. If this does not resolve your issue move on to an alter index all on <your table> rebuild.

Teal Canady
Sr. DBA
LRW
Post #1386535
Posted Saturday, November 24, 2012 10:05 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, April 24, 2014 11:36 PM
Points: 21, Visits: 92
Well, just tonight I've wrapped up my work on this issue and thought I'd post my findings for those who might find it useful.

When the problem between the lab and the production server first turned up, I did get focused on what obvious information was not up to performance, i.e. the Database IO rate which was originally 1.5Mb+ in the lab but only 0.3Mb in the production server. However, when I brought the production database into another lab server with the same configuration as the original lab test environment and it still produced on 0.3Mb throughput, I started digging deeper.

As I mentioned in another post, I discovered that there was a significant difference in the query plans created by the two servers. Since I have built a little monitoring tool for my Clustered indexes, I knew that the fragmentation rate on the two servers was roughly identical and I routinely rebuilt the indexes to ensure current statistics, etc. After studying the details, I traced the differences to the size of the data tables. While we had created a lab environment which stressed the number of transactions per hour, it did not approximate the number of detail records of the production environment. The process that I was working with is taking text based data and normalizing it into the appropriate Primary Keys including creating new keys/records where we don't have an appropriate record.

I created a clean database and re-processed my production data and noted that while the initial inserts happened as quickly as the lab server it did not take much time at all for the processing performance to degrade.

After studying the data for a while, I decided that direct table access via the indexes was causing the core problem. My processing procedures are in a Master/Slave arrangement where the master procedure picks the work to be done and then calls the slave procedure to actually handle the detail work. I simply added a new set of steps to the Master procedure to create 5 temp tables representing the known record/details for the five object classes in the project. Now when the detail procedure executes it merely performs a join between the transaction record and the 5 temp tables to determine which Objects are unknown and then when necessary it calls another procedure to create the missing object. If all items are known then the appropriate keys are supplied and the procedure inserts the new record into permanent storage. In my original design, the production database was running execution times of 500ms+ but in my new structure most executions run in 1-2ms and the long runs, where there are new records to create, take 30-50ms. A VAST improvement, don't you think?

On the lab server I had been able to boost my throughput to 180,000 transactions per hour but with the new procedure I've been able to boost that performance to 360,000 transactions per hour. This should keep me well above the actual transaction rates of the production environment. And that is still not 100% utilization. I'm using two jobs which handle 30,000 items in less than 3 minutes which then I stagger so that every 5 minutes one of them is processing records. This gives me the throughput and little to no contention for resources. Oh... and on the Database IO performance in the Activity Monitor. Using my new procedures the production database is now showing 2Mb-3Mb IO throughput so easily a 5x to 10x performance on the IO as well.

So, I would like to offer a final big thank you to all who took the time to respond to my post. I appreciated the willingness to share your professional insights and ideas as I worked my way through this challenge.
Post #1388365
Posted Sunday, November 25, 2012 8:10 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 5:34 PM
Points: 36,793, Visits: 31,250
And to think there are DBAs out there that forbid the use of Temp Tables. Divide'n'Conquer comes through again!

--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1388402
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse