Strange performance of simple query: sometimes fast, sometimes slow

  • I've run into a weird situation with a query that sometimes runs blazing fast (<1ms) but at other times can take around 500ms. For a less frequently executed query this wouldn't be such a problem but, unfortunately, this one's used a lot.

    SELECT [ta0_astd].[ApplicationStatusTrnGuid],

    [ta0_astd].[ApplicationGuid],

    [ta0_astd].[StatusKey],

    [ta0_astd].[IsCurrent],

    [ta0_astd].[UpdatedBy],

    [ta0_astd].[UpdateDate],

    [ta0_astd].[CreatedBy],

    [ta0_astd].[CreatedDate]

    FROM [app].[STATUS_TRN] AS ta0_astd

    WHERE ta0_astd.[ApplicationGuid] = @ApplicationGuid

    ;

    The primary key column is ApplicationStatusTrnGuid, and is non-clustered. The table has a clustered index on ApplicationGuid, since this is what we most commonly query against. Both of these columns are UNIQUEIDENTIFIERS. The use of UNIQUEIDENTIFIERS is obviously controversial but, in this case, it is what it is - there are simply too many dependencies in the app layer to switch to INTs.

    There are around 1.6 million rows in the table, and the above query can return anywhere between 0 and 25ish rows, so it's quite selective. Execution time does not correspond to the number of rows returned: sometimes the query is slow when no (or only a few rows are returned). Conversely it's also often fast when a couple of dozen rows come back.

    The table has a number of other indexes which have been added to cover various query scenarios. Here's the full definition:

    CREATE TABLE [app].[STATUS_TRN](

    [ApplicationStatusTrnGUID] [uniqueidentifier] ROWGUIDCOL NOT NULL DEFAULT (newsequentialid()),

    [ApplicationGUID] [uniqueidentifier] NOT NULL CONSTRAINT [DF__STATUS_TR__Appli__0D44F85C] DEFAULT ('00000000-0000-0000-0000-000000000000'),

    [StatusKey] [int] NOT NULL CONSTRAINT [DF__STATUS_TR__Statu__4737FD87] DEFAULT ((0)),

    [SourceKey] [int] NULL,

    [UpdateDate] [datetimeoffset](4) NULL DEFAULT (sysdatetimeoffset()),

    [AssignedUserKey] [int] NULL,

    [ReasonKey] [int] NULL CONSTRAINT [DF__STATUS_TR__Reaso__10216507] DEFAULT ((0)),

    [ReferRoleKey] [int] NULL,

    [UpdatedBy] [int] NULL,

    [CreatedDate] [datetimeoffset](4) NULL DEFAULT (sysdatetimeoffset()),

    [CreatedBy] [int] NULL,

    [IsCurrent] [bit] NOT NULL CONSTRAINT [DF_STATUS_TR_IsCurrent] DEFAULT ((0)),

    CONSTRAINT [PK_STATUS_TRN] PRIMARY KEY NONCLUSTERED

    (

    [ApplicationStatusTrnGUID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    GO

    CREATE CLUSTERED INDEX [IX_STATUS_TRN_ApplicationGUID] ON [app].[STATUS_TRN]

    (

    [ApplicationGUID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_ApplicationGUID_ApplicationStatusTrnGUID_UpdateDate_StatusKey] ON [app].[STATUS_TRN]

    (

    [ApplicationGUID] ASC,

    [ApplicationStatusTrnGUID] ASC,

    [UpdateDate] ASC,

    [StatusKey] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_StatusKey] ON [app].[STATUS_TRN]

    (

    [StatusKey] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_StatusKey_ApplicationGUID_ApplicationStatusTrnGUID_UpdateDate] ON [app].[STATUS_TRN]

    (

    [StatusKey] ASC,

    [ApplicationGUID] ASC,

    [ApplicationStatusTrnGUID] ASC,

    [UpdateDate] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    ALTER TABLE [app].[STATUS_TRN] ADD CONSTRAINT [FK_STATUS_TRN_APPLICATION_MST] FOREIGN KEY([ApplicationGUID])

    REFERENCES [app].[APPLICATION_MST] ([ApplicationGUID])

    GO

    ALTER TABLE [app].[STATUS_TRN] CHECK CONSTRAINT [FK_STATUS_TRN_APPLICATION_MST]

    GO

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_ApplicationGUID_StatusKey_UpdateDate] ON [app].[STATUS_TRN]

    (

    [ApplicationGUID] ASC,

    [StatusKey] ASC,

    [UpdateDate] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

    go

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_ApplicationGUID_with_UpdateDate] ON [app].[STATUS_TRN]

    (

    [ApplicationGUID] ASC

    )

    INCLUDE ( [UpdateDate]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_IsCurrent_StatusKey_ApplicationGUID] ON [app].[STATUS_TRN]

    (

    [IsCurrent] ASC,

    [StatusKey] ASC,

    [ApplicationGUID] ASC

    )WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

    GO

    CREATE NONCLUSTERED INDEX [IX_STATUS_TRN_ApplicationGUID_IsCurrent_StatusKey] ON [app].[STATUS_TRN]

    (

    [ApplicationGUID] ASC,

    [IsCurrent] ASC,

    [StatusKey] ASC

    )

    INCLUDE ( [ApplicationStatusTrnGUID],

    [UpdateDate]) WITH (SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]

    GO

    There are 216 rows in the table that have the default ApplicationGUID value of '00000000-0000-0000-0000-000000000000'. Values of ApplicationGUID are assigned on the app.APPLICATION_MST table using newsequentialid(). Applications in this table can change status through a workflow until they reach one of several final states (Live, Cancelled, etc.). Older applications don't tend to change status much because they're likely to already be in a final state.

    The database has RCSI enabled, and is currently relatively quiet since it's only being used for testing. Operations on the table are mostly reads, along with regular INSERTs. Rows are never UPDATEd.

    None of the indexes are fragmented. Here are the stats for them:

    Schema Table Index Avg. Frag % Page Count

    ===============================================================================================================================

    app STATUS_TRN IX_STATUS_TRN_ApplicationGUID 0.0267451190157796 18695

    app STATUS_TRN PK_STATUS_TRN 0.0342504852152072 8759

    app STATUS_TRN IX_STATUS_TRN_ApplicationGUID_ApplicationStatusTrnGUID_UpdateDate_StatusKey 0.0443616360571378 11271

    app STATUS_TRN IX_STATUS_TRN_ApplicationGUID_IsCurrent_StatusKey 0.0436262106273449 11461

    app STATUS_TRN IX_STATUS_TRN_ApplicationGUID_StatusKey_UpdateDate 0.0611246943765281 8180

    app STATUS_TRN IX_STATUS_TRN_ApplicationGUID_with_UpdateDate 0.0540248514316586 7404

    app STATUS_TRN IX_STATUS_TRN_IsCurrent_StatusKey_ApplicationGUID 0.541842263696568 6644

    app STATUS_TRN IX_STATUS_TRN_StatusKey 0.419189566837448 6441

    app STATUS_TRN IX_STATUS_TRN_StatusKey_ApplicationGUID_ApplicationStatusTrnGUID_UpdateDate 0.221906621693591 11266

    Query performance, as viewed from the client Web API application, typically looks as shown in IntermittentSlowAppStatusQuery1.PNG and IntermittentSlowAppStatusQuery2.PNG (attached). Note that this data was collected with ANTS Performance Profiler, so queries are shown in order of decreasing execution time, not order of execution.

    You can see that roughly half the queries are very fast, whilst half are slow. The query is exactly the same every time, with only the value of the parameter varying. The query is generated via the Dapper.SimpleLoad library, and executed using the Dapper microORM.

    The execution plan for the worst performing instance, which doesn't appear to show anything amiss, is shown in IntermittentSlowAppStatusQueryExecutionPlan.PNG (also attached).

    When this query is run over all values of ApplicationGUID using SQL Server Management Studio directly on the SQL Server box (i.e., RDPing into the SQL Server box and running SSMS actually on the box) the observed behaviour differs. The query is still, around 20-25% of the time, intermittently slow, but the effect is much less pronounced. Generally it runs in <1ms, but when it's slow, it takes 30-170ms.

    The fact that it runs slow less often and, moreover, is less slow, when run on the box, suggests that the observed problem in the web app may in fact be nothing to do with SQL Server, or the web app, but could be an infrastructure issue. I'm particularly suspicious about a networking issue.

    Here's an overview of the infrastructure...

    SQL Server 2014 is running on bare metal, dual socket hex core Xeons E5-2643 v3s clocked at 3.4GHz, with 128GB of RAM, of which 100GB is allocated to SQL Server. Storage wise we have 15k SAS internal storage divided into 4 volumes:

    OS - RAID 1, 275GB usable, 208GB free

    DATA1 - RAID 10, 1.08TB usable, 917GB free

    LOGS - RAID 10, 1.08TB usable, 1.04TB free

    TEMPDB - RAID 10, 1.08TB usable, 1.02TB free

    Data, logs, and tempdb are stored as the volume names indicate.

    Separately we also have a SATA array in a single volume:

    SATA - RAID 6, 7.27TB usable, 3.20GB free - used for FILESTREAM files (customer documents), and backups

    Storage connectivity is 10 gigabit iSCSI.

    The client is a .NET Web API application running in IIS 8.5 on Windows Server 2012 R2 Datacenter Edition, backed by SQL Server 2014 Standard Edition.

    The web servers comprise two VMs with 4 virtual cores running on VMWare ESXI 6 with 12GB RAM allocated, also running Windows Server 2012 R2 Datacenter. The physical processors are 10 core Xeons (E5-2660 v2) running at 2.60Ghz. They are fronted by a load balancer (I'm not sure what the spec is).

    Both the web server hosts and the SQL Server box are installed in the same datacentre, in the same physical rack, with gigabit connectivity between. They are, however, in separate firewalled VLANs.

    I also plan to run SQL Profiler on the box to see if anything obvious comes up there when this query runs. I'm not expecting to see anything particularly interesting, but it pays to make sure.

    Can anyone see anything obvious that might be causing this issue? Is it likely to be something to do with the VLAN or firewall configuration, or am I just clutching at straws with that?

    Any help gratefully received!

    Thanks,

    Bart

  • 1) That's a LOT of stuff for us "unpaid consultants" to wade through to offer help on a forum!!!

    2) I would run sp_whoisactive regularly (including using differential capabilities) to check for intermittent short-term blocking issue.

    3) I am leaning towards a compilation issue, at least if the problem is in SQL Server. ORMs throw text-based queries at SQL Server, often unparameterized at that. You can get crushed with a number of compilation and plan-cache/caching issues as a result. Can Dapper microORM be forced to parameterize queries if it isn't? Can you try Optimize for Ad Hoc Workloads? Can you try Forced Parameterization (note this is a sledge hammer setting that can definitely have negative consequences, but I have seen it be a magic bullet in a handful of circumstances over the years).

    4) You didn't give the index definitions in your script. You have to set up SSMS to actually script indexes as one of the settings in the tool. As a consultant I also always have triggers scripted too, btw, because clients tend to do very bad things with those. 🙂 It is possible that the optimizer goes wonky with plan choice if you have some overlapping indexes. Also note that more indexes can greatly increase compiliation time in some cases.

    5) You didn't mention exactly how you did your iterative testing in SSMS over all values and how that compares to actual code fired by microORM. That could be meaningful.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Well, it seems as though we're always asking for information and you've already gathered quite a bit, so well done there. Kevin's already asked for good stuff. I'd like to know if there's an index that covers the query. Another is how current the statistics are on the index the query uses. Also, do you have a clustered index on your table or did I miss it?

  • TheSQLGuru (1/27/2016)


    1) That's a LOT of stuff for us "unpaid consultants" to wade through to offer help on a forum!!!

    Haha - and I bet you would have moaned at me if I hadn't posted enough info as well. Seriously, I have a bee in my bonnet when people ask for help but don't give you the full story.

    TheSQLGuru (1/27/2016)


    2) I would run sp_whoisactive regularly (including using differential capabilities) to check for intermittent short-term blocking issue.

    Thanks, I'll give this a try.

    TheSQLGuru (1/27/2016)


    3) I am leaning towards a compilation issue, at least if the problem is in SQL Server. ORMs throw text-based queries at SQL Server, often unparameterized at that. You can get crushed with a number of compilation and plan-cache/caching issues as a result. Can Dapper microORM be forced to parameterize queries if it isn't? Can you try Optimize for Ad Hoc Workloads? Can you try Forced Parameterization (note this is a sledge hammer setting that can definitely have negative consequences, but I have seen it be a magic bullet in a handful of circumstances over the years).

    I did wonder about that but in this case the query is *always* parameterised. Nonetheless, it seems sensible for me to have a look at Optimize for Ad Hoc Workloads.

    TheSQLGuru (1/27/2016)


    4) You didn't give the index definitions in your script. You have to set up SSMS to actually script indexes as one of the settings in the tool. As a consultant I also always have triggers scripted too, btw, because clients tend to do very bad things with those. 🙂 It is possible that the optimizer goes wonky with plan choice if you have some overlapping indexes. Also note that more indexes can greatly increase compiliation time in some cases.

    My bad. I've yanked the original table definition from source control. I've removed most of the FK and CHECK definitions because they're probably irrelevant. All the index definitions are now there though.

    TheSQLGuru (1/27/2016)


    5) You didn't mention exactly how you did your iterative testing in SSMS over all values and how that compares to actual code fired by microORM. That could be meaningful.

    It was done with a cursor, I think. I've asked the DBA for the exact code he used since he ran it whilst I was looking at a different issue. I'll post it up as soon as I have it.

    Thanks for your help!

  • As I suspected, you have MULTIPLE NC indexes that start with ApplicationGUID. Since that value is VERY specific (a few tens of rows max out of 1.6M) I bet you can consolidate down to one (or perhaps 2, with the second EXACTLY covering a VERY critical query(s)) index with all ancillary columns INCLUDED.

    BUT, honestly, that is a fairly tight table. Well, other than those RIDICULOUS GUIDs. And probably using INTS for things that have low-enough value counts to use TINYint or SMALLints for. But I digress... 😀 Since the rows are small, and clustered on ApplicationGUID, I really question the value of those NC indexes with that column leading at all. That is a HUGE amount of work to maintain, store, FRAGMENT, DEFRAGMENT, lock/latch, update stats, backup, checkdb, etc on!! You should get row locks on essentially every query that hits that table via ApplicationGUID (even if no other columns are included to narrow it down) so concurrency should go way up without those other indexes (along with ALL of the afformentioned benefits). If you do drop the indexes be sure to test at scale and also consider creating some manual multi-column stats to replace the statistics you will be losing (not necessary critical though given the specificity of ApplictionGUID by itself).

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Ed Wagner (1/27/2016)


    Well, it seems as though we're always asking for information and you've already gathered quite a bit, so well done there. Kevin's already asked for good stuff. I'd like to know if there's an index that covers the query. Another is how current the statistics are on the index the query uses. Also, do you have a clustered index on your table or did I miss it?

    Hi Ed, thanks for getting back to me.

    Yes, the query is covered by an index (apologies - hadn't spotted I'd forgotten to include it in my original post):

    CREATE CLUSTERED INDEX [IX_STATUS_TRN_ApplicationGUID] ON [app].[STATUS_TRN]

    (

    [ApplicationGUID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    GO

    The clustered index is on ApplicationGuid, rather than the primary key, because the table is most frequently queried by ApplicationGuid. The primary key definition is:

    ...

    CONSTRAINT [PK_STATUS_TRN] PRIMARY KEY NONCLUSTERED

    (

    [ApplicationStatusTrnGUID] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    ) ON [PRIMARY]

    Statistics on the table were last updated yesterday morning, after the table was bulk loaded, and not much has changed since. I did an UPDATE STATISTICS and compared the before and after, just to be sure, but no changes. Interestingly I did find a couple of things I didn't expect: 633 rows with the same ApplicationGuid in one case, plus higher EQ_ROWS than my original post would have indicated - up to 71 in one case. Average range rows generally fall in the 10-20 range though, with a few lower outliers, and a small handful in low 20s.

  • TheSQLGuru (1/27/2016)


    As I suspected, you have MULTIPLE NC indexes that start with ApplicationGUID. Since that value is VERY specific (a few tens of rows max out of 1.6M) I bet you can consolidate down to one (or perhaps 2, with the second EXACTLY covering a VERY critical query(s)) index with all ancillary columns INCLUDED.

    Interesting you mention this because I'd already queried it yesterday. I think that's a good shout. I'll try one or two consolidated indexes and measure what happens. Testing at scale shouldn't present too much of a challenge since we already have a full load of (real) data in place, so it's then just a case of applying a realistic load to the app, which is covered as part of our UAT.

  • After reading about your covering index and statistics, I have to agree with Kevin about the index reduction. The next place I'd go is one you probably won't like - the GUIDs. I'm a fan of integers for the clustered index, but I don't know if there's anything you can realistically do about it now.

  • Ed Wagner (1/27/2016)


    After reading about your covering index and statistics, I have to agree with Kevin about the index reduction. The next place I'd go is one you probably won't like - the GUIDs. I'm a fan of integers for the clustered index, but I don't know if there's anything you can realistically do about it now.

    Yeah, I know. I agree about the GUIDs, but they're literally everywhere - historical artifact. It's not something the client has time to deal with before release. Pulling them out of the app layer (and, in some key cases, they're used all the way up to the web front end) and retesting would take weeks, which is time they don't have. They make me sad because they're an absolute chore to deal with.

  • I don't think this is likely to be needed any more but here's the SQL used to run the query over all ApplicationGUIDs. No cursor in sight, as it turns out:

    DECLARE @AppGuid AS UNIQUEIDENTIFIER

    DECLARE @NumberRecords INT

    DECLARE @RowCount INT

    CREATE TABLE #Apps (

    RowID INT IDENTITY(1, 1)

    ,ApplicationGUID UNIQUEIDENTIFIER

    )

    INSERT INTO #Apps(ApplicationGUID)

    SELECT ApplicationGUID FROM app.APPLICATION_MST

    SET @NumberRecords = @@ROWCOUNT

    SET @RowCount = 1

    WHILE @RowCount <= @NumberRecords

    BEGIN

    SELECT @AppGuid = ApplicationGUID

    FROM #Apps

    WHERE RowID = @RowCount

    SET STATISTICS TIME ON

    SELECT

    [ta0_astd].[ApplicationStatusTrnGuid],

    [ta0_astd].[ApplicationGuid],

    [ta0_astd].[StatusKey],

    [ta0_astd].[IsCurrent],

    [ta0_astd].[UpdatedBy],

    [ta0_astd].[UpdateDate],

    [ta0_astd].[CreatedBy],

    [ta0_astd].[CreatedDate]

    FROM [app].[STATUS_TRN] AS ta0_astd

    WHERE ta0_astd.[ApplicationGuid] = @AppGuid

    SET STATISTICS TIME OFF

    SET @RowCount = @RowCount + 1

    END

    DROP TABLE #Apps

    I think the index reduction is the key, although this will come in handy later to check.

  • TheSQLGuru (1/27/2016)


    As I suspected, you have MULTIPLE NC indexes that start with ApplicationGUID. Since that value is VERY specific (a few tens of rows max out of 1.6M) I bet you can consolidate down to one (or perhaps 2, with the second EXACTLY covering a VERY critical query(s)) index with all ancillary columns INCLUDED.

    BUT, honestly, that is a fairly tight table. Well, other than those RIDICULOUS GUIDs. And probably using INTS for things that have low-enough value counts to use TINYint or SMALLints for. But I digress... 😀 Since the rows are small, and clustered on ApplicationGUID, I really question the value of those NC indexes with that column leading at all. That is a HUGE amount of work to maintain, store, FRAGMENT, DEFRAGMENT, lock/latch, update stats, backup, checkdb, etc on!! You should get row locks on essentially every query that hits that table via ApplicationGUID (even if no other columns are included to narrow it down) so concurrency should go way up without those other indexes (along with ALL of the afformentioned benefits). If you do drop the indexes be sure to test at scale and also consider creating some manual multi-column stats to replace the statistics you will be losing (not necessary critical though given the specificity of ApplictionGUID by itself).

    Please do try with NO non-clustered indexes too. Analyze your run/post-run metrics to see if it isn't acceptable or if you perhaps need just ONE very specific NC/INCLUDE index to get the performance you need.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • 1) WHILE looping is essentially the same as a cursor - Row By Agonizing Row processing either way.

    2) In this case it was appropriate to do that because you WANT to ITERATE over each value and do something specific for it. I also note that your calling mechanism for that code is VERY different from your parameterized hit from the microORM. ANY difference between executions can have no, small, medium or drastic effects. Never forget that!!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • redundant comment - deleted.

  • Other matters have intervened, but I thought it was worth posting the end of the story here.

    There ended up being bigger fish to fry at the time, so I had to move on to other things, but this issue has reoccurred, then disappeared for a bit, then reoccurred again subsequently.

    I was back with this client last week, and seeing exactly the same kind of behaviour on wide variety of different queries. In fact it was the main reason they asked me to come back in because they'd been struggling to isolate and fix the problem themselves for some weeks.

    It was, to put it mildly, weird. You immediately think of the obvious things (indexing, parameter sniffing, etc.), but none of this seemed to fit the bill because the behaviour just wasn't right. Profiling SQL Server directly suggested that nothing was amiss: uniform reads and writes, along with CPU usage, for all instances of a query, affected by the delay and otherwise.

    This made me suspect something network related - something introducing latency - but I had to prove it, obviously.

    I put together a few test cases, including one that just executed the same query 100 times with the same parameters, and saw the same behaviour: SQL Profiler indicates that all is well, but query duration from the clients perspective is either 0-1ms, or ~500ms and nowhere in between.

    It was at this point that I fired up Wireshark. I wanted to collect traces from both web host VM and SQL Server box (SQL Server running on bare metal). This turned out to be more of a challenge than I'd anticipated since running Wireshark on both boxes was in and of itself enough to "fix" the issue, and it looks like it stays fixed until reboot - I was starting to run out of VMs to test with. At first I wasn't sure this was even the case since the issue is intermittent anyway - sometimes few queries are affected, at other times almost all, or anywhere in between.

    I then tried just capturing a trace on SQL Server and saw clearly the delay between sending back the results from one query, and receiving the next query, of 500ms. SQL Server was indeed, as profiler suggested, turning round every single query in less than 1ms.

    I then re-ran the test and this time, whilst the test was still running, started Wireshark on the web host VM. Immediately, whilst the test was still running, queries began to complete quickly (<1ms again), and stayed that way even after I quit Wireshark.

    So I did a bit of digging around and Googling. Turns out that our hypervisor is VMWare vSphere, and they have a whitepaper about deploying latency sensitive applications at http://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/latency-sensitive-perf-vsphere55-white-paper.pdf.

    The recommendations in this whitepaper will be implemented at a convenient moment this week and, I suspect, that'll fix the issue for good.

    Anyhow, bottom line is it isn't and never was anything to do with SQL Server, and it's always worth considering your virtualisation and infrastructure when you encounter this kind of issue.

    Thanks again for all the input on my original post.

  • Thanks for the very interesting follow-up! I can't count the number of clients I have had over the years that have totally screwed up their virtualization efforts, often disastrously so!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply