• Teal,

    Thank you for the long response. I've reviewed again a number of the points you brought up and have branched out on a few other areas as well. I'll summarize what I've learned and where my study of this problem is branching.

    1. Some details missing:

    None of the systems are on SANS. These are all local SCSI based RAID controllers with 4 or 5 volumes. All volumes are mirrors. Because of customer need, we've had to revert the production server back to the previous version of the software. I brought the production database back in-house to run on a second lab server that is set up with identical hardware to the first lab server. We had originally configured the second server to run the original version of the software in parallel with the new version.

    I secretly hoped that there was just a difference with the Production versus lab environment, but the restored database performs at or below the production rates on the second lab server.

    The lab servers both have 32Gb of RAM and SQL Server has a min/max setting of 0/30270. MaxDOP is 0 over 8 Cores. The original, well performing database is over 171Gb in size now and the other database is currently around 8 Gb.

    These are dedicated SQL Servers and have no other network or service responsibilities. We will run the desktop application occasionally for development purposes. There is a management agent service as part of the application, but its activity on the server has not seemed to have an impact on the first lab server and currently it is turned off on the second lab server.

    2. I did some more internet trolling and decided to check out sys.dm_io_virtual_file_stats. The article I read gave some formulas for a Write Latency calculation by taking the io_stall_write_ms and dividing it by the num_of_writes. The latency on the well performing server is 2 (log), 192 (Primary) and 18 (EventData). The equivalent latency on the other server is 2, 43, and 39. So again, the results seem to be reversed. The machine with the problem overall seems better off than the one I'd like to match performance wise.

    3. My research has turned another direction. I ran a sample of the normalization procedure which I'm currently concerned about on the Production server and the Lab server with the Show Execution Plan turned on. I discovered that the Optimizer creates very different plans on the two machines. When I deployed to the second server, I ran another sample and got the same query plan as production. I've begun looking at the Schema again even though the production database was generated from scripts made from the lab server. The execution plan does not indicate that there is a missing index that could improve the speed of the performance, but the server with the problem is showing an extra loop join in critical points in the procedure. I'm going through the database structure again trying to understand why two different query plans would be generated.

    I did take your advice and run the Wait stats query you provided. The first lab server has some much larger wait time numbers because it has been online longer but what I noted most of all about the results is that the top 10 on the well performing server were all system agents like Log Manager, Deadlock Search, Broker, Checkpoint, Dispatcher, etc. On the second lab server, WriteLog, Pageiolatch_sh and pageiolatch_ex all made it into the top 10. Each had a total wait time less than 1000, but they still ranked in the top 10 while none appeared in the top 10 of the other server.

    Well, thank you again for the thoughtful response. I'm still diving deep.