Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

SQL IO Performance Expand / Collapse
Author
Message
Posted Tuesday, November 13, 2012 4:41 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, April 24, 2014 11:36 PM
Points: 21, Visits: 92
I'm running out of brain cells trying to figure out what I've overlooked. Hoping someone has a quick checklist of things I can look for...

Scenario: I have 3 SQL 2008 servers, 2 lab/testing, 1 production with almost identical hardware (lab servers run 32Gb ram while the production server is at 24Gb). Disk configuration involves 4 mirrors (OS, Transactional Data, Warehouse Data, Log). I deployed process which is executed by 4 jobs. On my lab server, I was able to optimize the performance to process one event record every 55-62ms. When I pushed that process to the production server the same syntax processes one record every 550-650ms. The production server shows no waits or recognizable performance problems in Performance Monitor.

The only thing that points me in any direction is that on the lab server the Database IO on the Activity Monitor runs at 1.5Mb/sec with routine spikes above that while in the production environment the IO stubbornly runs at 0.3 MB/sec with occasional and dramatic spikes upward to the 4-5 MB/sec range. We tested raw disk performance and it runs at 100Mb/sec easily. I pulled out SQLIO and it shows >6.0Mb/sec in production and only 5.xMb/sec on the lab server. SQLIOSim, however, shows a significant difference. Running the standard IO configuration file in SQLIOSim, the lab server completes the entire run in less than 1/3 the time of the Production server.

It just feels like there is some sort of governor that is capping the IO performance, but I can't think of any particular setting that would do that... especially where it would have been turned off in one place and not the others. Just looking for tips on what else I can look at to uncover why one machine seems to scream while the others are just put putting along.

Thanks,

Rob
Post #1384342
Posted Tuesday, November 13, 2012 9:23 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, January 10, 2013 10:08 PM
Points: 5, Visits: 94
Rob,

I know the end of the rope feeling so I thought I would at least try to help out. I'm assuming at this point you have double or triple checked that the same code is in place; have you also checked the db config and file config of the db in question? Have you checked the query plan on both environments to ensure they are identical? In addition I would also check the server config for RAM; MAXDOP etc. Confirm there is a difference after running DBCC FREEPROCCACHE as well; could be an old query plan in there.
Are either of the hard drives you’re running on different, SSD vs.15k vs. SATA? Are they on the same SAN, different SAN or local? Even if the hard drives are the same in the same mirrored configurations it may be that they are different SANS different caching.

Any other information would be helpful if you can provide it.
Post #1384377
Posted Tuesday, November 13, 2012 10:20 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, April 1, 2014 3:26 PM
Points: 316, Visits: 1,497
compare prod vs test perfmon counters for logical disk reads per sec, write per sec, seconds per write, seconds per read, and current disk queue length . Any discrepancies?



Post #1384385
Posted Wednesday, November 14, 2012 1:55 AM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Yesterday @ 11:49 AM
Points: 129, Visits: 861
Someone already asked about drive differences so maybe I can ask something else:

Is the data in both environments the same or have you rebuilt indexes, refreshed everything, restarted sql server to get a "clean" test for both. -- or does production have more data, and not as "clean" as the Lab Servers? (Though assuming production box you don't have that opportunity maybe?)

Though I think you already checked "select" from sys.configurations - if you haven't you could compare there too, but guessing that's not the place.

Shot in the dark - could your recovery model be different in Prod and TestLab?




--------------------------------------------------
...0.05 points per day since registration... slowly crawl up to 1 pt per day hopefully
Post #1384463
Posted Wednesday, November 14, 2012 3:00 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 1:56 PM
Points: 6,365, Visits: 13,695
robr-793239 (11/13/2012)
I'm running out of brain cells trying to figure out what I've overlooked. Hoping someone has a quick checklist of things I can look for...

Scenario: I have 3 SQL 2008 servers, 2 lab/testing, 1 production with almost identical hardware (lab servers run 32Gb ram while the production server is at 24Gb). Disk configuration involves 4 mirrors (OS, Transactional Data, Warehouse Data, Log). I deployed process which is executed by 4 jobs. On my lab server, I was able to optimize the performance to process one event record every 55-62ms. When I pushed that process to the production server the same syntax processes one record every 550-650ms. The production server shows no waits or recognizable performance problems in Performance Monitor.

The only thing that points me in any direction is that on the lab server the Database IO on the Activity Monitor runs at 1.5Mb/sec with routine spikes above that while in the production environment the IO stubbornly runs at 0.3 MB/sec with occasional and dramatic spikes upward to the 4-5 MB/sec range. We tested raw disk performance and it runs at 100Mb/sec easily. I pulled out SQLIO and it shows >6.0Mb/sec in production and only 5.xMb/sec on the lab server. SQLIOSim, however, shows a significant difference. Running the standard IO configuration file in SQLIOSim, the lab server completes the entire run in less than 1/3 the time of the Production server.

It just feels like there is some sort of governor that is capping the IO performance, but I can't think of any particular setting that would do that... especially where it would have been turned off in one place and not the others. Just looking for tips on what else I can look at to uncover why one machine seems to scream while the others are just put putting along.

Thanks,

Rob

Could you supply a little more detail on how the disks are setup and presented to each system?


-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs"
Post #1384495
Posted Wednesday, November 14, 2012 10:25 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, April 24, 2014 11:36 PM
Points: 21, Visits: 92
I appreciate all the responses giving me additional things to consider.

A little more color to fill in on the details. The servers both run 4 mirrored SCSI volumes for OS, PRIMARY Filegroup, EVENTDATA Filegroup, and Log files. Fragmentation rates across all the tables are about equal. All clustered indexes on both servers were rebuilt within the last week. Production server was started up on Friday and has accumulated about 5 million Event Data records while the lab server has been in operation for over 3 months and has accumulated 91 million event records.

An interesting event occurred last night in production. I logged on to the production server to do some more poking and I found that the Database IO rate was up to 5-6 Mb/sec. I was floored so I looked around and realized that the database backup routine had started and it was performing a backup. This morning when I checked in again it was back to its 0.3Mb/sec throughput.

So, the only glimmer of insight that I've realized now about the differences is that we relatively recently added a 4th volume to hold the TempDb. When we did it, I was attempting to address the same performance issue in the lab server, but it seemed that the change did not improve the overall throughput of the event processing. However, in our production system, the TempDb is on the same volume as the PRIMARY File Group. My current tack will be to try to separate those two and see what happens.

Again, I appreciate the feedback and opportunity to see the problem through other eyes.

Rob
Post #1384760
Posted Wednesday, November 14, 2012 10:50 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, September 2, 2014 2:48 PM
Points: 194, Visits: 326
What version of Windows are these running on? Or, more importantly- are you sure the disk partitions are aligned [in both environments]? Nothing leads me to suspect this is your issue, but is a good thing to be sure of.

http://support.microsoft.com/default.aspx?scid=kb;en-us;929491&sd=rss&spid=3198



-Ken
Post #1384770
Posted Wednesday, November 14, 2012 11:18 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, January 10, 2013 10:08 PM
Points: 5, Visits: 94
Rob,


Do you have SQL Sentry or SQL Monitor in place? Your post regarding backups jogged my memory of a simlar situation with one of my production environments where processing a particular cube was causing havoc with I/O and CPU to my other databases. Depending on what kind of event processing is taking place TempDB residing with Primary could definitly be the culprit as could file growth on your production systems if it is set to autogrow etc.

Teal Canady
Post #1384777
Posted Wednesday, November 14, 2012 3:03 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, September 11, 2014 7:39 PM
Points: 65, Visits: 400
robr-793239 (11/14/2012)
However, in our production system, the TempDb is on the same volume as the PRIMARY File Group. My current tack will be to try to separate those two and see what happens.


Hmmmm, that might be it. Post back after you move it and what the results are.
Post #1384894
Posted Wednesday, November 14, 2012 6:16 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, April 1, 2014 3:26 PM
Points: 316, Visits: 1,497
I'm still curious about latency (avg sec per read and avg sec per write for each logical drive )

how many iops does sqlio show for 8k random reads and writes?

5mb/sec is really slow for backups, especially if the drive can handle 100mb/s.

Can you post some wait stats?

What does cpu look like?
Post #1384944
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse