SQL IO Performance

  • Teal,

    Thank you for the long response. I've reviewed again a number of the points you brought up and have branched out on a few other areas as well. I'll summarize what I've learned and where my study of this problem is branching.

    1. Some details missing:

    None of the systems are on SANS. These are all local SCSI based RAID controllers with 4 or 5 volumes. All volumes are mirrors. Because of customer need, we've had to revert the production server back to the previous version of the software. I brought the production database back in-house to run on a second lab server that is set up with identical hardware to the first lab server. We had originally configured the second server to run the original version of the software in parallel with the new version.

    I secretly hoped that there was just a difference with the Production versus lab environment, but the restored database performs at or below the production rates on the second lab server.

    The lab servers both have 32Gb of RAM and SQL Server has a min/max setting of 0/30270. MaxDOP is 0 over 8 Cores. The original, well performing database is over 171Gb in size now and the other database is currently around 8 Gb.

    These are dedicated SQL Servers and have no other network or service responsibilities. We will run the desktop application occasionally for development purposes. There is a management agent service as part of the application, but its activity on the server has not seemed to have an impact on the first lab server and currently it is turned off on the second lab server.

    2. I did some more internet trolling and decided to check out sys.dm_io_virtual_file_stats. The article I read gave some formulas for a Write Latency calculation by taking the io_stall_write_ms and dividing it by the num_of_writes. The latency on the well performing server is 2 (log), 192 (Primary) and 18 (EventData). The equivalent latency on the other server is 2, 43, and 39. So again, the results seem to be reversed. The machine with the problem overall seems better off than the one I'd like to match performance wise.

    3. My research has turned another direction. I ran a sample of the normalization procedure which I'm currently concerned about on the Production server and the Lab server with the Show Execution Plan turned on. I discovered that the Optimizer creates very different plans on the two machines. When I deployed to the second server, I ran another sample and got the same query plan as production. I've begun looking at the Schema again even though the production database was generated from scripts made from the lab server. The execution plan does not indicate that there is a missing index that could improve the speed of the performance, but the server with the problem is showing an extra loop join in critical points in the procedure. I'm going through the database structure again trying to understand why two different query plans would be generated.

    I did take your advice and run the Wait stats query you provided. The first lab server has some much larger wait time numbers because it has been online longer but what I noted most of all about the results is that the top 10 on the well performing server were all system agents like Log Manager, Deadlock Search, Broker, Checkpoint, Dispatcher, etc. On the second lab server, WriteLog, Pageiolatch_sh and pageiolatch_ex all made it into the top 10. Each had a total wait time less than 1000, but they still ranked in the top 10 while none appeared in the top 10 of the other server.

    Well, thank you again for the thoughtful response. I'm still diving deep.

  • If you are seeing different plans, then I'd concentrate on that.

    Any chance you can post both the good and bad versions of the plans?

  • Rob,

    The difference in query plans could be two fold. After the system parsing takes place on a query for syntax etc. it then binds the objects and creates a logical query plan based on statistics; after that it does costing based on those stats to find the right plan. if the statistics are our of date; the sql optimzier will change the logical plan and then as a result the physical plan. Other factors are size of physical database between your environments.

    If you have not already done so I would run sp_updatestats and the databases in question and possibly run alter index all on <your table> reorganize. if you have a second lab now where you can reproduce this I'd start there with the queries above. If this does not resolve your issue move on to an alter index all on <your table> rebuild.

    Teal Canady

    Sr. DBA

    LRW

  • Well, just tonight I've wrapped up my work on this issue and thought I'd post my findings for those who might find it useful.

    When the problem between the lab and the production server first turned up, I did get focused on what obvious information was not up to performance, i.e. the Database IO rate which was originally 1.5Mb+ in the lab but only 0.3Mb in the production server. However, when I brought the production database into another lab server with the same configuration as the original lab test environment and it still produced on 0.3Mb throughput, I started digging deeper.

    As I mentioned in another post, I discovered that there was a significant difference in the query plans created by the two servers. Since I have built a little monitoring tool for my Clustered indexes, I knew that the fragmentation rate on the two servers was roughly identical and I routinely rebuilt the indexes to ensure current statistics, etc. After studying the details, I traced the differences to the size of the data tables. While we had created a lab environment which stressed the number of transactions per hour, it did not approximate the number of detail records of the production environment. The process that I was working with is taking text based data and normalizing it into the appropriate Primary Keys including creating new keys/records where we don't have an appropriate record.

    I created a clean database and re-processed my production data and noted that while the initial inserts happened as quickly as the lab server it did not take much time at all for the processing performance to degrade.

    After studying the data for a while, I decided that direct table access via the indexes was causing the core problem. My processing procedures are in a Master/Slave arrangement where the master procedure picks the work to be done and then calls the slave procedure to actually handle the detail work. I simply added a new set of steps to the Master procedure to create 5 temp tables representing the known record/details for the five object classes in the project. Now when the detail procedure executes it merely performs a join between the transaction record and the 5 temp tables to determine which Objects are unknown and then when necessary it calls another procedure to create the missing object. If all items are known then the appropriate keys are supplied and the procedure inserts the new record into permanent storage. In my original design, the production database was running execution times of 500ms+ but in my new structure most executions run in 1-2ms and the long runs, where there are new records to create, take 30-50ms. A VAST improvement, don't you think? 😀

    On the lab server I had been able to boost my throughput to 180,000 transactions per hour but with the new procedure I've been able to boost that performance to 360,000 transactions per hour. This should keep me well above the actual transaction rates of the production environment. And that is still not 100% utilization. I'm using two jobs which handle 30,000 items in less than 3 minutes which then I stagger so that every 5 minutes one of them is processing records. This gives me the throughput and little to no contention for resources. Oh... and on the Database IO performance in the Activity Monitor. Using my new procedures the production database is now showing 2Mb-3Mb IO throughput so easily a 5x to 10x performance on the IO as well.

    So, I would like to offer a final big thank you to all who took the time to respond to my post. I appreciated the willingness to share your professional insights and ideas as I worked my way through this challenge.

  • And to think there are DBAs out there that forbid the use of Temp Tables. Divide'n'Conquer comes through again! 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 16 through 19 (of 19 total)

You must be logged in to reply to this topic. Login to reply