Large Buffer Latch wait

  • Sql Server 2012 Enterprise 64 bit 500 GB of memory. Netapp SAN for data. Should I be concerned about the high buffer latch wait? We do have a lot of poor ORM generated queries in a combined oltp/reporting database. Maxdop is 8 Cost threshold 40

    select latch_class, REPLACE(CONVERT(money, waiting_requests_count, 1), '.00', '') as waitingRequestCount,

    max_wait_time_ms/1000 as maxwaitseconds,

    wait_time_ms/1000 as waittimeseconds from sys.dm_os_latch_stats

    order by waiting_requests_count desc

    latch_class.....waitingRequestCount.....maxwaitseconds.....waittimeseconds

    BUFFER.............3673072609....................36..................10093008

    ACCESS_METHODS.....275357397..............54..................503141

    DATASET_PARENT

    LatchClass.....Wait_S..... WaitCount..........Percentage.....AvgWait_S

    BUFFER.....10087941.83.....3658178345.....95.23.....0.0028

    [font="Verdana"]LatchClass............................................Wait_S

    ACCESS_METHODS_DATASET_PARENT..... 503132.41

    WaitCount.....Percentage.....AvgWait_S

    275353378.....99.49.....0.0018[/font]

    WITH [Latches] AS

    (SELECT

    [latch_class],

    [wait_time_ms] / 1000.0 AS [WaitS],

    [waiting_requests_count] AS [WaitCount],

    100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage],

    ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum]

    FROM sys.dm_os_latch_stats

    WHERE [latch_class] NOT IN (N'BUFFER') and

    [wait_time_ms] > 0

    )

    SELECT

    MAX ([W1].[latch_class]) AS [LatchClass],

    CAST (MAX ([W1].[WaitS]) AS DECIMAL(14, 2)) AS [Wait_S],

    MAX ([W1].[WaitCount]) AS [WaitCount],

    CAST (MAX ([W1].[Percentage]) AS DECIMAL(14, 2)) AS [Percentage],

    CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (14, 4)) AS [AvgWait_S]

    FROM [Latches] AS [W1]

    INNER JOIN [Latches] AS [W2]

    ON [W2].[RowNum] <= [W1].[RowNum]

    GROUP BY [W1].[RowNum]

    HAVING SUM ([W2].[Percentage]) - MAX ([W1].[Percentage]) < 95; -- percentage threshold

    GO

  • 1) You missed telling us a critical piece of information: how long as the server been up? You provide aggregate information so if the server has been up for a very long time the waits wouldn't be as meaningful as if it has only been up for a few weeks or months.

    2) I don't normally start out tuning by looking at latches. What do the waits look like?

    3) I despise NetApp. Regardless of that, what is the IO configuration? Are you more on the 10 spindles in a RAID5 or 100 in RAID10 side of things? And the pipe from SAN to server?

    4) Possibly more important than waits are the IO stalls on the box.

    5) Are things not running quickly enough? Your aggregate numbers are 24/7, which often includes server maintenance, ETL, batch process, etc where it doesn't matter how slow things are as long as they are completing in the given window.

    The proper thing to do is a differential analysis of waits/io stalls (and latches if you want). Capture a snapshot, wait for some minutes while stuff of concern is happening, capture, join and diff and divide by time to get things/ms.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Some of that I can answer. The sql server has been up since December 8th, 2016.

    This morning the query showing the most buffer latch wait is pasted in below. It does a full text index search across a large table without specifying a customer id ( development has the ORM inserting a "largeclient" parameter in a failed attempt to make sql create a new plan.)

    Here's a screenshot of locks/latches from our Quest Spotlight tool.

    SELECT TOP 100

    AccountOwnershipDocSummary02.ACCOUNT_ID AS AccountId,

    AccountOwnershipDocSummary02.OWNERSHIP_DOC_ID AS OwnershipDocId,

    AccountOwnershipDocSummary02.OWNER_FULL_NAMES AS Owners,

    AccountOwnershipDocSummary02.BORROWER_FULL_NAMES AS Borrowers,

    AccountOwnershipDocSummary02.OWNERSHIP_DOC_MODIFIED_MANUFACTURER_ID

    AS OwnershipDocVin,

    AccountOwnershipDocSummary02.ACCOUNT_MODIFIED_MANUFACTURER_ID

    AS ModifiedAccountVin,

    AccountOwnershipDocSummary02.ORIGINAL_ACCOUNT_MANUFACTURER_ID

    AS OriginalAccountVin,

    AccountOwnershipDocSummary02.OWNERSHIP_DOC_ISSUING_STATE_ABBR

    AS OwnershipDocState,

    AccountOwnershipDocSummary02.CLIENT_ID AS ClientId,

    AccountOwnershipDocSummary02.LIENHOLDER_IDENTIFIER

    AS LienholderIdentifier,

    AccountOwnershipDocSummary02.CUSTOM_ATTRIBUTE_1 AS AccountNumber,

    AccountOwnershipDocSummary02.CUSTOM_ATTRIBUTE_2 AS LoanNumber,

    AccountOwnershipDocSummary02.CUSTOM_ATTRIBUTE_3 AS LoanSuffix,

    AccountOwnershipDocSummary02.CUSTOM_ATTRIBUTE_4 AS Branch,

    AccountOwnershipDocSummary02.EXPECTED_TITLING_STATE_ABBR

    AS AccountState,

    AccountOwnershipDocSummary02.STATUS AS Status,

    AccountOwnershipDocSummary02.LICENSE_NUMBER AS LicenseNumber,

    AccountOwnershipDocSummary02.OWNERSHIP_DOC_DOCUMENT_NUMBER

    AS TitleNumber,

    Client14.SHORT_NAME AS ClientShortName,

    @LargeClient2 AS LargeClient

    FROM ACCOUNT_OWNERSHIP_DOC_SUMMARY AS AccountOwnershipDocSummary02

    INNER JOIN

    ORGANIZATION AS Client14

    ON AccountOwnershipDocSummary02.CLIENT_ID = Client14.ORGANIZATION_ID

    WHERE ( ( (CONTAINS (AccountOwnershipDocSummary02

    .BORROWER_FULL_NAMES, @DerivedTable01_BORROWER_FULL_NAMES20)

    OR ( CONTAINS(AccountOwnershipDocSummary02.OWNER_FULL_NAMES,

    @DerivedTable01_OWNER_FULL_NAMES41))))) ORDER BY 1

  • 1) Tune the hell out of your FTS subsystem.

    2) See if OPTION (RECOMPILE) can improve things here - or dynamic SQL (guarding against SQL Injection, obviously).

    3) Given that you are ordering by AccountID (which smells like a clustered pk identity) if that is what is carried into FTS I would get the top 100 AccountIDs just based on the FTS predicate into a temp table and then hit the full set of tables for the final output. If there is a FK into the Organization table you should remove that from the query that gets the accountids to process to make it possibly way faster.

    4) You have a lot of latches per second. But again I think waits and IO stalls are more important until proven otherwise.

    5) What does sp_whoisactive show? Note that this can also do a differential analysis and provide a TON of useful information.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I'm the guy waiting for development to figure out how to add option(recompile) to some of the ORM-generated queries 🙂 I have been creating some plan guides with option(recompile) along with targeted filtered stats on large tables/large clients since we definitely have parameter sniffing/data skew causing bad plans to be re-used by large clients.

    I also just added a job to freeProcCache 3 times a week. Things do seem better, time will tell.

  • I do reorganize the full text index every night.

    ALTER FULLTEXT CATALOG [SummaryTableSearchFullNames] REORGANIZE

    I run these frequently

    EXEC sp_whoisactive @find_block_leaders = 1, --@get_additional_info = 1,--@format_output = 2,

    @sort_order = '[blocked_session_count] desc',-- '[cpu] desc', --

    @output_column_list = '[dd hh:mm:ss.mss][session_id][database_name][sql_text][percent_complete][block%][host_name][login_name][start_time][login_time][sql_command][used_memory][temp%]

    [wait_info][??tasks][tran_log%][cpu%][reads%][writes%][context%][physical%][quer??y_plan][locks][collection_time]';

    go

    exec sp_blitzwho

  • Indianrock (1/5/2017)


    I'm the guy waiting for development to figure out how to add option(recompile) to some of the ORM-generated queries 🙂 I have been creating some plan guides with option(recompile) along with targeted filtered stats on large tables/large clients since we definitely have parameter sniffing/data skew causing bad plans to be re-used by large clients.

    I also just added a job to freeProcCache 3 times a week. Things do seem better, time will tell.

    Back in the day I had to set up a freeproccache job to run every 15 minutes!! Horrible ADOclassic app on SQL 2005 pre SP1 (where the procedure cache could take up to 80% or so of your buffer pool).

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply