Performace issue Physical vs Virtual

  • I've got the following issue with a simple select * from large_table running on SQLServer 2012. The physical machine (Production) returns the result set in approx 3.5 mins, the virtual (UAT) returns in 1.8 mins. The table size is equivalent in both environments. The physical machine that hosts the virtual machine is exactly the same spec as the physical machine so use the same CPU's.

    I've made sure there's no physical I/O going on, all the reads are logical.

    A breakdown of the CPU/elapsed times are:

    Production UAT

    CPU Times (ms) 18220 11279

    Elps Time (ms) 232879 110659

    So it looks like both environments are waiting on something as there's a big difference between CPU and Elapsed time. I've then traced the waits for the sessions involved and got the following results

    Production UAT

    NETWORK_IO 19807 96869

    PREEMPTIVE_OS_WAIT_FORSINGLEOBJECT 19727 95873

    So the waits in UAT are much larger than Production where its slow????

    I'm fairly new to SQL Server, coming from an Oracle background so maybe my wait collection technique is wrong. I've used event sessions to capture the waits i.e.

    create event session session_waits on server

    add event sqlos.wait_info

    (WHERE sqlserver.session_id=102 and duration>0)

    , add event sqlos.wait_info_external

    (WHERE sqlserver.session_id=102 and duration>0)

    add target package0.asynchronous_file_target

    (SET filename=N'c:\temp\wait_stats.xel', metadatafile=N'c:\temp\wait_stats.xem');

    Any help/ideas on what could be going would be much appreciated.

    Thanks,

    Fraze

  • Umpteen things could be at play here. Most likely culprits include:

    1) are you ABSOLUTELY sure all the IO is logical? If not, physical IO issues would be my first place.

    2) are you CERTAIN that there is no blocking at all that is keeping one query from running for some period of time?

    3) get a copy of sp_whoisactive (latest version is v11.11 I think, from sqlblog.com). run that iteratively while each select is running and see what percolates up. You can also do differential analysis with it, which could be key to finding out real issue(s)

    4) Network config differences. TCPChimney is a common issue here. Check network stack and transfer/latency rates.

    5) Create variables and to the select of each field into a variable. That will let you test pure read speed without network/client coming into play.

    Those are just the tip of the iceberg.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Great, thanks for the reply.

    1) Yep, I'm sure its all logical - I've re-run the query 20+ times and get the same performance difference consistently - I've used

    SET STATISTICS IO ON

    SET STATISTICS TIME ON

    to get the I/O stats

    2) I'm reasonably certain there are no locks - at least non of the tools I'm using show any locking.

    I'll investigate the other suggestions.

    Thanks,

    Fraze

  • So I've gone with the suggestion of selecting into variables (nice idea - thanks) and performance is still twice as slow on the physical machine vs the virtual one

    Physical

    CPU time - 24,039 ms

    Elapsed - 29,864 ms

    Virtual

    CPU time - 12,090 ms

    Elapsed - 12,769 ms

    I'm going to copy the table to another physical machine to test there

    Thanks,

    Fraze

  • OK, so I've tested on an instance on another physical machine that is exactly the same hardware spec as the other physical and the physical running the VM. In all three tests all were logical I/O and 99% of time was on the CPU.

    The new physical was slightly quicker than the VM which I'd expect, then the "problem" physical was twice as slow as in previous tests.

    Any pointers as to what could be the issue? O/S or CPU issue?

    Thanks,

    Fraze

  • fraze2001 (8/12/2013)


    OK, so I've tested on an instance on another physical machine that is exactly the same hardware spec as the other physical and the physical running the VM. In all three tests all were logical I/O and 99% of time was on the CPU.

    The new physical was slightly quicker than the VM which I'd expect, then the "problem" physical was twice as slow as in previous tests.

    Any pointers as to what could be the issue? O/S or CPU issue?

    Thanks,

    Fraze

    Can't believe I forgot this one: check the BIOS and Windows Power Settings to ensure everything is "High Power" and not set to throttle CPUs in any way. That is a HUGE issue I see over and over on client servers when I give them a performance review.

    Have you made sure that no one did something silly like mess with affinity settings either in windows or SQL Server?

    Cant recall: is maxdop 0 on all machines? Wondering if NUMA issues could be coming into play here. Also, there was a very nasty bug related to NUMA and CPU/Memory utilization that wasn't fixed until SQL 2012 SP1 CU4. Could be hitting that? I was pretty stunned to see what that one could do to a server, although I don't think it applies here in any case now that I think about it.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks very much Kevin,

    Well the power settings came up on some further digging I did and sure enough the physical machine that was performing "well" High and the "poor" performing one had it set to Balanced, so I made the change and performance has improved significantly for some queries. I'm seeing 100% improvement in some of the long running jobs!

    I haven't checked the BIOS yet as it is a Production machine so can't reboot them yet.

    The "problem" machine is base SQL Server 2012, no SP's and does have hardware NUMA - 2 nodes, where as the virtual machine has no NUMA. The other physical I tested on was a SQL Server 2008 SP1 machine with 2 NUMA nodes.

    With my test query I've seen the CPU come down by 50% after applying the power settings, but the elapsed time is still around the same which is odd. I've tried to trap any wait events through Extended Events, but its coming back with nothing so I can't explain the difference between CPU time and elapsed time. The elapsed time is wildly inconsistent too, ranging from 19s to 33s while the CPU time is always around the same, approx. 16s. There server is quiet so I cant see this being time spent on the CPU queue. I've added an attachment showing the client stats for a few runs of the same query.

    Thanks,

    Fraze

  • Glad we found some improvements. There are still quite a few things to check, but I think though that anything more will be beyond the capabilities of a forum to really dig into.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply