SQL IO Performance

  • I'm running out of brain cells trying to figure out what I've overlooked. Hoping someone has a quick checklist of things I can look for...

    Scenario: I have 3 SQL 2008 servers, 2 lab/testing, 1 production with almost identical hardware (lab servers run 32Gb ram while the production server is at 24Gb). Disk configuration involves 4 mirrors (OS, Transactional Data, Warehouse Data, Log). I deployed process which is executed by 4 jobs. On my lab server, I was able to optimize the performance to process one event record every 55-62ms. When I pushed that process to the production server the same syntax processes one record every 550-650ms. The production server shows no waits or recognizable performance problems in Performance Monitor.

    The only thing that points me in any direction is that on the lab server the Database IO on the Activity Monitor runs at 1.5Mb/sec with routine spikes above that while in the production environment the IO stubbornly runs at 0.3 MB/sec with occasional and dramatic spikes upward to the 4-5 MB/sec range. We tested raw disk performance and it runs at 100Mb/sec easily. I pulled out SQLIO and it shows >6.0Mb/sec in production and only 5.xMb/sec on the lab server. SQLIOSim, however, shows a significant difference. Running the standard IO configuration file in SQLIOSim, the lab server completes the entire run in less than 1/3 the time of the Production server.

    It just feels like there is some sort of governor that is capping the IO performance, but I can't think of any particular setting that would do that... especially where it would have been turned off in one place and not the others. Just looking for tips on what else I can look at to uncover why one machine seems to scream while the others are just put putting along.

    Thanks,

    Rob

  • Rob,

    I know the end of the rope feeling so I thought I would at least try to help out. I'm assuming at this point you have double or triple checked that the same code is in place; have you also checked the db config and file config of the db in question? Have you checked the query plan on both environments to ensure they are identical? In addition I would also check the server config for RAM; MAXDOP etc. Confirm there is a difference after running DBCC FREEPROCCACHE as well; could be an old query plan in there.

    Are either of the hard drives you’re running on different, SSD vs.15k vs. SATA? Are they on the same SAN, different SAN or local? Even if the hard drives are the same in the same mirrored configurations it may be that they are different SANS different caching.

    Any other information would be helpful if you can provide it.

  • compare prod vs test perfmon counters for logical disk reads per sec, write per sec, seconds per write, seconds per read, and current disk queue length . Any discrepancies?

  • Someone already asked about drive differences so maybe I can ask something else:

    Is the data in both environments the same or have you rebuilt indexes, refreshed everything, restarted sql server to get a "clean" test for both. -- or does production have more data, and not as "clean" as the Lab Servers? (Though assuming production box you don't have that opportunity maybe?)

    Though I think you already checked "select" from sys.configurations - if you haven't you could compare there too, but guessing that's not the place.

    Shot in the dark - could your recovery model be different in Prod and TestLab?

    --------------------------------------------------
    ...0.05 points per day since registration... slowly crawl up to 1 pt per day hopefully 😀

  • robr-793239 (11/13/2012)


    I'm running out of brain cells trying to figure out what I've overlooked. Hoping someone has a quick checklist of things I can look for...

    Scenario: I have 3 SQL 2008 servers, 2 lab/testing, 1 production with almost identical hardware (lab servers run 32Gb ram while the production server is at 24Gb). Disk configuration involves 4 mirrors (OS, Transactional Data, Warehouse Data, Log). I deployed process which is executed by 4 jobs. On my lab server, I was able to optimize the performance to process one event record every 55-62ms. When I pushed that process to the production server the same syntax processes one record every 550-650ms. The production server shows no waits or recognizable performance problems in Performance Monitor.

    The only thing that points me in any direction is that on the lab server the Database IO on the Activity Monitor runs at 1.5Mb/sec with routine spikes above that while in the production environment the IO stubbornly runs at 0.3 MB/sec with occasional and dramatic spikes upward to the 4-5 MB/sec range. We tested raw disk performance and it runs at 100Mb/sec easily. I pulled out SQLIO and it shows >6.0Mb/sec in production and only 5.xMb/sec on the lab server. SQLIOSim, however, shows a significant difference. Running the standard IO configuration file in SQLIOSim, the lab server completes the entire run in less than 1/3 the time of the Production server.

    It just feels like there is some sort of governor that is capping the IO performance, but I can't think of any particular setting that would do that... especially where it would have been turned off in one place and not the others. Just looking for tips on what else I can look at to uncover why one machine seems to scream while the others are just put putting along.

    Thanks,

    Rob

    Could you supply a little more detail on how the disks are setup and presented to each system?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I appreciate all the responses giving me additional things to consider.

    A little more color to fill in on the details. The servers both run 4 mirrored SCSI volumes for OS, PRIMARY Filegroup, EVENTDATA Filegroup, and Log files. Fragmentation rates across all the tables are about equal. All clustered indexes on both servers were rebuilt within the last week. Production server was started up on Friday and has accumulated about 5 million Event Data records while the lab server has been in operation for over 3 months and has accumulated 91 million event records.

    An interesting event occurred last night in production. I logged on to the production server to do some more poking and I found that the Database IO rate was up to 5-6 Mb/sec. I was floored so I looked around and realized that the database backup routine had started and it was performing a backup. This morning when I checked in again it was back to its 0.3Mb/sec throughput.

    So, the only glimmer of insight that I've realized now about the differences is that we relatively recently added a 4th volume to hold the TempDb. When we did it, I was attempting to address the same performance issue in the lab server, but it seemed that the change did not improve the overall throughput of the event processing. However, in our production system, the TempDb is on the same volume as the PRIMARY File Group. My current tack will be to try to separate those two and see what happens.

    Again, I appreciate the feedback and opportunity to see the problem through other eyes.

    Rob

  • What version of Windows are these running on? Or, more importantly- are you sure the disk partitions are aligned [in both environments]? Nothing leads me to suspect this is your issue, but is a good thing to be sure of.

    http://support.microsoft.com/default.aspx?scid=kb;en-us;929491&sd=rss&spid=3198


    -Ken

  • Rob,

    Do you have SQL Sentry or SQL Monitor in place? Your post regarding backups jogged my memory of a simlar situation with one of my production environments where processing a particular cube was causing havoc with I/O and CPU to my other databases. Depending on what kind of event processing is taking place TempDB residing with Primary could definitly be the culprit as could file growth on your production systems if it is set to autogrow etc.

    Teal Canady

  • robr-793239 (11/14/2012)


    However, in our production system, the TempDb is on the same volume as the PRIMARY File Group. My current tack will be to try to separate those two and see what happens.

    Hmmmm, that might be it. Post back after you move it and what the results are.

  • I'm still curious about latency (avg sec per read and avg sec per write for each logical drive )

    how many iops does sqlio show for 8k random reads and writes?

    5mb/sec is really slow for backups, especially if the drive can handle 100mb/s.

    Can you post some wait stats?

    What does cpu look like?

  • Ok... so here are some new information:

    I moved the TempDb to a separate ESATA drive on the server. My initial thought was that things seemed a little better but 5 hours later, if anything, the system is running slower than it was. Jobs that were running in 2 1/2 hours are now taking 2 3/4 hours.

    I have some more specific statistics to share.

    For SQLIO results, the Production server is running 881.66 IOPS with 6.88Mb/s on 8k and a 5 minute run. The Lab server runs 678.28 for 5.29Mb/s.

    On Performance monitor, I'm seeing the following:

    Avg Disk Sec. Read Write Transfer

    Primary 0.017 0.028 0.202

    EventData 0.017 0.028 0.028

    Log 0.000 0.003 0.003

    TempDB 0.000 0.004 0.004

    I just observed an event that has given me some pause. Maybe this will change the nature of the conversation. What I've been focusing on is procedure that fires off and moves data from a raw storage table to a normalized data warehouse table. This requires taking data from remote clients and looking up key values and doing data conversion and other actions such as detecting new data and creating keys, etc. On the lab server I can run 4 instances of this job offset by 5 min each at together they will clear 180,000 records an hour. The production server, not so much... However, I happened to be watching the server at the top of the hour when an external agent fires off a sync command to the remote clients and they begin dumping their records to the raw storage table. In this sync process, eight remote clients at a time will dump their records in batches of 100 records each. What caught my attention is that during this phase, the Database IO rate shot up to over 100Mb/sec and the processor which had been pegged at 50% utilization was suddenly peaking at 100%. After the sync process completed, the IO went back down to 0.3 and the processor dropped back to 50%.

    Could there be some odd result because this process is fired from a Job? Of course, I'm just talking out loud now... working my way through the system a piece at a time.

    Oh, and addressing an earlier question about the structure. The production server was just deployed with fresh scripts from the latest working Lab deployment that we have so I'm confident that the procedure logic is identical to what is working extremely well on the Lab Server.

  • This is probably way off base ... but is there something external to the database getting in the way? For example an over enthusiastic virus checker.

  • You could hunt-and-peck on this forum for days or weeks for a problem such as this (and I have seen that happen any number of times) and never find a true cause/resolution. Or you could hire a qualified professional to come on board for a few hours or a day or two and really get things nailed down (while simultaneously mentoring you on their methodology). Seems like a no-brainer to me...

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Are you using SSIS packages in your jobs? If so, make sure the server does not have any resource contention between the SSIS and SQL services.

    I believe someone has mentioned this already but have you checked the Max Memory and Max DOP settings? Make sure you have the Max Memory set to something below the max server RAM.

  • Rob,

    To summarize your system here is what I have gathered so far:

    RamRaid QuantityBreak down

    Lab32GBRaid 1-04OS,Trans,DWH,Logs

    Prod24GBRaid 1-04OS,Trans,DWH,Logs

    Things we don’t know

    •Is it on a SAN

    •Processor Types and Cores Identical?

    •Maximum Memory

    •Minimum Memory

    •Max DOP setting on each system

    What I understand of your problem is this:

    In Lab you can run four simultaneous processes and the event translation takes around 50-60ms. In production there are 8 simultaneous processes which take nearly 10x the time (550ms to 650ms). There also appears to be a generally high CPU utilization of 50% in production; backups are also causing contention with I/O when they kick off.

    Based on that:

    SAN:

    I’m asking about your SAN because if indeed you are on a SAN your I/O stats could be lying to you. I believe SQLIO is based off WMI events and Perfmon counters. The metrics you have may or may not be accurate.

    Question:

    Are these disks on a SAN or no?

    RAM:

    The ram is different from production to LAB; in this case your lab has more. If your production server comes under pressure from too little RAM it can start paging out and causing I/O issues, CPU issues etc. So for instance with 24 GB in place maybe you have 20GB as your max and 8GB as your minimum setting. This would generally be ok provided that you don’t have other external programs running on this server which operate out of SQL Servers reserved memory pool. This would include items from within SQL such as CLR’s and Linked server calls.

    When SQL comes under pressure it will try flipping out what’s in memory but only from its reserved pool; any other memory being freed relies on the OS to release the memory. Again this will include linked server calls and CLR’s etc. In earlier editions of SQL; such as 2005 I sometimes have found it necessary to lock MIN and MAX memory in place with the same values as it couldn’t release the RAM in a timely fashion; but that is neither here nor there.

    Question:

    What are you minimum and maximum memory settings on your production server? Are there any other external programs running on the server? Are you using CLR’s extensively?

    Settings can be gathered via the GUI or via sp_configure.

    --Allow all options to be visable

    exec sp_configure 'show advanced options',1;

    GO

    --Execute this now as opposed to the next restart of the server

    Reconfigure with override;

    go

    --View all the settings

    exec sp_configure;

    go

    MaxDOP:

    Depending on how many cores you have and if this process is going parallel this setting matters a great deal. If you have a MAXDOP of 0 and this is going parallel then even a single processing event could be using all the cores on the system.

    Question:

    How many cores are on your system and what is the Maximum Degree of parallelism that is currently set?

    See sp_configure above.

    --Taken from http://msdn.microsoft.com/en-us/magazine/cc135978.aspx

    SELECT TOP 10

    [Wait type] = wait_type,

    [Wait time (s)] = wait_time_ms / 1000,

    [% waiting] = CONVERT(DECIMAL(12,2), wait_time_ms * 100.0

    / SUM(wait_time_ms) OVER())

    FROM sys.dm_os_wait_stats

    WHERE wait_type NOT LIKE '%SLEEP%'

    ORDER BY wait_time_ms DESC;

    *Apologies I forget where I got this from; if someone knows the author or URL I’ll add it

    Clustered Indexes:

    If you’re missing clustered indexes / Primary Keys it can cause your CPU to go nuts as SQL tries to satisfy a query across a heap. It’s worth taking a looks at the db’s in question as well as any others on the system to see if this is the case.

    --Missing Primary Keys:

    SELECT

    t.TABLE_CATALOG

    ,t.TABLE_SCHEMA

    ,t.TABLE_NAME

    ,t.TABLE_TYPE

    FROM

    INFORMATION_SCHEMA.TABLES t

    LEFT JOIN INFORMATION_SCHEMA.TABLE_CONSTRAINTS pk

    ON t.TABLE_CATALOG = pk.TABLE_CATALOG

    AND t.TABLE_NAME = pk.TABLE_NAME

    AND t.TABLE_SCHEMA = pk.TABLE_SCHEMA

    AND pk.CONSTRAINT_TYPE = 'PRIMARY KEY'

    WHERE

    pk.TABLE_NAME IS NULL

    AND t.TABLE_TYPE='BASE TABLE'

    ORDER BY

    t.TABLE_CATALOG

    ,t.TABLE_SCHEMA

    ,t.TABLE_NAME

    --Missing Clustered Indexes

    SELECT name

    FROM sys.objects

    WHERE type = 'U'

    AND object_id NOT IN

    (SELECT object_id FROM sys.indexes WHERE index_id = 1)

    *Apologies I forget where I got this from; if someone knows the author or URL I’ll add it

    Missing indexes:

    This can also be a point of contention as your using statics to satisfy queries. It might be worth looking into it to see if you can optimize those tables and queries causing high CPU.

    --Count of missing indexes

    SELECT

    DatabaseName = DB_NAME(database_id)

    , count(*) as [Number Indexes Missing]

    FROM sys.dm_db_missing_index_details

    GROUP BY DB_NAME(database_id)

    ORDER BY 2 DESC;

    --Top ten missing indexes based on usage since last restart

    SELECT TOP 10

    [Total Cost] = ROUND(avg_total_user_cost * avg_user_impact * (user_seeks + user_scans),0)

    , avg_user_impact

    , TableName = statement

    , [EqualityUsage] = equality_columns

    , [InequalityUsage] = inequality_columns

    , [Include Cloumns] = included_columns

    FROM sys.dm_db_missing_index_groups g

    INNER JOIN sys.dm_db_missing_index_group_stats s

    ON s.group_handle = g.index_group_handle

    INNER JOIN sys.dm_db_missing_index_details d

    ON d.index_handle = g.index_handle

    ORDER BY [Total Cost] DESC;

    *Apologies I forget where I got this from; if someone knows the author or URL I’ll add it

    If all this doesn’t point you in the right direction or give you a better idea of what could be going on you have another couple of options that are more time consuming; which is why I left them at the end.

    Install the trial version of SQLSentry or SQLMonitor ; monitor your lab environment and production and see what is going on. Each of these products will provide metrics and a handy GUI for you to help better analyze your environment.

    Finally and least appealing to me (because it’s the most time consuming) would be to install SQL 2012 somewhere and hook up the Distributed Reply feature; this would allow you to capture and then reply what is going on in your production environment directly in your lab. I hope this helps.

    Teal Canady

    Sr. DBA

    Lieberman Research Worldwide

    *Please excuse any spelling or grammatical mistakes; I'm just a DBA: )

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply