Time for a consultant

  • I haven't discussed this with management yet ( so not ready to entertain offers ) but I think it's time to bring in a consultant. We're one of those shops with .net C# ORM generating sql for a browser front end, sql server backend application.

    The main thing killing us right now is some internal reports that are run in-house during the business day. These have time constraints but they are going to see if they can be spread out more ( yes we need a data warehouse but the argument has always been they want real-time data )

    Development is going to see if they can improve the report code, including developing sql outside of the ORM. Like many, we've largely dealt with the growth by spending money on hardware, and putting our aspstate session DB and tempdb on flash drives is coming up next quarter.

    We brought in a consultant several years back, selected by a former CIO who wasn't knowledgeable. I don't think the consultant was all that bad, and realized he wasn't going to convert this shop entirely to stored procedures ( non-starter ) , but as I recall, most of his suggestions were ignored ( after paying him a tidy sum ).

    Our main user DB is nearing 6 TB. Running on a two-node ( active/passive ) Server 2008 R2 cluster. Sql enterprise core edition. Sql and windows are 64-bit. 512GB memory on each server -- 64 logical cores

    I think I'm quite a ways beyond the "accidental DBA," but can see enough areas needing expert analysis ( even outside the sql generation ) that I need help. I'll be discussing that with my manager soon.

    Just one example: we turned on forced parameterization years ago -- probably after observing many single-use plans. ( even though the application code almost always sends parameterized sql ). Was this a mistake?

    I now have a nightly job removing single-use plans that haven't executed during the past 24 hours. Most of what I now see as compiled/adhoc plans are from our production support team along with some bits from agent jobs and our Spotlight monitoring tool.

    select cp.objtype, count(*) FROM sys.dm_exec_cached_plans AS cp

    CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) st

    WHERE cp.cacheobjtype = N'Compiled Plan' AND cp.objtype IN (N'Adhoc', N'Prepared') and

    cp.usecounts = 1 group by cp.objtype

    OPTION (RECOMPILE);

    objtype Count

    Adhoc1105

    Prepared303

  • How much of the plan cache is being used for AdHoc queries? Can you post the results of this query?

    SELECT objtype AS [CacheType],

    COUNT_BIG(*) AS [Total Plans],

    SUM(CAST(size_in_bytes AS DECIMAL(18, 2))) / 1024 / 1024 AS [Total MBs],

    AVG(usecounts) AS [Avg Use Count],

    SUM(CAST((CASE WHEN usecounts = 1 THEN size_in_bytes

    ELSE 0

    END) AS DECIMAL(18, 2))) / 1024 / 1024 AS [Total MBs – USE Count 1],

    SUM(CASE WHEN usecounts = 1 THEN 1

    ELSE 0

    END) AS [Total Plans – USE Count 1]

    FROM sys.dm_exec_cached_plans

    GROUP BY objtype

    ORDER BY [Total MBs – USE Count 1] DESC

    GO

  • Not sure if this wil be readable. Keep in mind the deletion job I mentioned in my first post.

    CacheTypeTotal PlansTotal MBsAvg Use CountTotal MBs – USE Count 1Total Plans – USE Count 1

    Prepared1694613440.3281394647114.804687406

    Adhoc12525259.965087384526.24116511463

    Proc156204.992187945477.937516

    View105292.45312512341.21093718

    Check913.21093754750.44531213

    UsrTab70.257812390820.031251

    Trigger42.1484371597400

    CacheTypeTotal PlansTotal MBsAvg Use CountTotal MBs – USE Count 1Total Plans – USE Count 1

    Prepared1694613440.3281394647114.804687406

    Adhoc12525259.965087384526.24116511463

    Proc156204.992187945477.937516

    View105292.45312512341.21093718

    Check913.21093754750.44531213

    UsrTab70.257812390820.031251

    Trigger42.1484371597400

    CacheTypeTotal PlansTotal MBsAvg Use CountTotal MBs – USE Count 1Total Plans – USE Count 1

    Prepared1694613440.3281394647114.804687406

    Adhoc12525259.965087384526.24116511463

    Proc156204.992187945477.937516

    View105292.45312512341.21093718

    Check913.21093754750.44531213

    UsrTab70.257812390820.031251

    Trigger42.1484371597400

  • Ok, does not appear to be any cache bloat. How are your WAITS? Please run below and post:

    WITH [Waits] AS (SELECT [wait_type], [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DBMIRROR_DBM_EVENT', N'DBMIRROR_EVENTS_QUEUE', N'DBMIRROR_WORKER_QUEUE', N'DBMIRRORING_CMD', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'ONDEMAND_TASK_QUEUE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT') AND [waiting_tasks_count] > 0 )SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S]FROM [Waits] AS [W1]INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum]GROUP BY [W1].[RowNum]HAVING SUM ([W2].[Percentage]) - MAX ([W1].[Percentage]) < 95; -- percentage thresholdGO

  • I have unraveled such situations quite few times, always some nice surprises;-)

    😎

    Two questions, what is the isolation level used? What is the ratio of optimization time outs?

  • ][font="Arial"]WaitType Wait_SResource_SSignal_SWaitCountPercentageAvgWait_SAvgRes_SAvgSig_S

    CXPACKET17341361.0916919887.22421473.86133025910855.480.01300.01270.0003

    PAGEIOLATCH_SH4728690.064716703.3811986.6828037161315.130.01690.01680.0000

    WRITELOG2040599.381975409.1165190.272366001496.530.00860.00830.0003

    LCK_M_IS 1837891.571837834.7956.77481775.8838.148738.14760.0012

    LCK_M_U 1712299.881704071.818228.07276175805.480.06200.06170.0003

    PAGEIOLATCH_EX941725.85940379.731346.11608315603.010.01550.01550.0000

    LCK_M_IX520778.58520768.769.82185121.6728.131928.13140.0005

    IO_COMPLETION392239.67388572.053667.62651472321.250.00600.00600.0001

    SOS_SCHEDULER_YIELD377693.255320.11372373.1410257554441.210.00040.00000.0004[/font]

    isolation etc

    [font="Arial"]DBnamesnapshot_isolation_statesnapshot_isolation_state_descis_read_committed_snapshot_onrecovery_modelrecovery_model_descpage_verify_optionpage_verify_option_descis_auto_create_stats_onis_auto_update_stats_onis_auto_update_stats_async_onis_ansi_null_default_onis_ansi_nulls_onis_ansi_padding_onlog_reuse_waitlog_reuse_wait_descis_date_correlation_on

    COLLATERALMANAGER0OFF11FULL2CHECKSUM1110000NOTHING0

    ASPState0OFF03SIMPLE2CHECKSUM1110001CHECKPOINT0

    Messaging0OFF01FULL2CHECKSUM1110002LOG_BACKUP0

    Reporting0OFF11FULL2CHECKSUM1110000NOTHING0[/font]

  • Ok, so CXPACKET waits are at the top of your waits and account for over 50% of the total waits. What is your MAXDOP (Max degree of Parallelism) and your Cost threshold for Parallelism set to for your instance currently?

  • Just one example: we turned on forced parameterization years ago -- probably after observing many single-use plans. ( even though the application code almost always sends parameterized sql ). Was this a mistake?

    Yes, that was a mistake. That is absolutely not what that setting is for. Optimize for Ad Hoc Workloads is the setting you want to use for lots of single-use plans. But with a crushingly bad ORM you can actually wind up with the HASH lookups becoming it's own bottleneck.

    Drop me a private message if you are seriously interested in getting a consultant on board. I have 45000 hours of my life wrapped around the SQL Server relational engine, the vast majority of it as a consultant and ALL of that always concerned with performance. I promise I can find low-hanging fruits that can help ease the pain and a plethora of medium and long-term objectives to go and fix. I have plenty of references I can provide if desired. I also promise that you will ignore my recommendations only if logic, reason and math just don't make sense to you because I will show you why your stuff is bad in irrefutable terms. 😎

    BTW, you had better test going to bigger hardware - AT SCALE - BEFORE you go there!! I have had a client make their system unusable doing that (without telling me, of course, despite my having worked with them for 6 years) and others that had less painful yet still poor experiences compared to their expectations.

    Bet I know why your CXPACKET waits are happening too ... even more than the obvious missing indexes potential.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • namevalue_in_use

    cost threshold for parallelism40

    max degree of parallelism8

    max server memory (MB) 458,752

    min server memory (MB) 356,352

    optimize for ad hoc workloads1

  • I've emailed my manager about the possibility of bringing in a consultant.

    Can't seem to paste Excel in here so attached the info requested on Waits, sql configurations, etc.

  • Indianrock (11/25/2016)


    I've emailed my manager about the possibility of bringing in a consultant.

    Can't seem to paste Excel in here so attached the info requested on Waits, sql configurations, etc.

    If you choose to go with me I will say that I doubt I am like any other consultant you have had. And I have cleaned up after quite a few over the years, including ones from large entities that rhyme with names such as Accenture and Microsoft. :hehe:

    Your metrics aren't too surprising, at least assuming the server has been up for a very long time. 200 DAYS of CXPACKET waits, almost 100 DAYS of IO-related stalls. And the locks likely have IO to thank in part for being so high too. Those long IS/IX lock waits are often an easy fix once the culprit(s) are identified.

    You have over 11K single use plans, but they only take up 26MB of RAM so who cares. Seems reasonable given optimize for ad hoc workloads is on and you flush plans each night. The prepared plan number and size deserves some investigation. Still think forced parameterization will need to go. I have only seen one true win with that since it came out, and that was an abysmally bad classic ADO nested looping recordset mess of an app.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • ORMs work, up until they don't. Too many tables joined, sometimes with outer joins, too many values in "IN" statements ( e.g. 50+ ), parameters declared as varchar matching the string length rather than a set varchar(8000) or matching the column length. One large table ( 100 million records ) has grown through the code release process over years to 130 columns and 30 indexes -- this one is involved in deadlocks we get daily during the 3 runtimes of the report I mentioned in the original post.

    The good news is that large paying customers are being impacted by these internal reports and starting to complain loudly -- so the heat is on both development and the operations group that runs these reports.

    -- 31,350 Total Plans

    select COUNT(*) as totalPlanCount from sys.dm_exec_cached_plans

    -- 2,315 queries where compilation timed out

    select count(*)

    from sys.dm_exec_query_stats

    cross apply sys.dm_exec_sql_text(sql_handle) as qt

    cross apply sys.dm_exec_query_plan(plan_handle) as qp

    where

    charindex('StatementOptmEarlyAbortReason="TimeOut"', convert(nvarchar(max),qp.query_plan)) > 0

  • Indianrock (11/25/2016)


    ORMs work, up until they don't. Too many tables joined, sometimes with outer joins, too many values in "IN" statements ( e.g. 50+ ), parameters declared as varchar matching the string length rather than a set varchar(8000) or matching the column length. One large table ( 100 million records ) has grown through the code release process over years to 130 columns and 30 indexes -- this one is involved in deadlocks we get daily during the 3 runtimes of the report I mentioned in the original post.

    The good news is that large paying customers are being impacted by these internal reports and starting to complain loudly -- so the heat is on both development and the operations group that runs these reports.

    -- 31,350 Total Plans

    select COUNT(*) as totalPlanCount from sys.dm_exec_cached_plans

    -- 2,315 queries where compilation timed out

    select count(*)

    from sys.dm_exec_query_stats

    cross apply sys.dm_exec_sql_text(sql_handle) as qt

    cross apply sys.dm_exec_query_plan(plan_handle) as qp

    where

    charindex('StatementOptmEarlyAbortReason="TimeOut"', convert(nvarchar(max),qp.query_plan)) > 0

    The variable variable lengths obviously contribute to the total plan count. You should be thankful at least in that they are varchar and not Nvarchar.

    I have had a number of clients that hobble by with better hardware, duct tape and rubber bands until it all comes crashing down. Sadly I have had two clients bring me on board too late to save them when they tried to go up-market and I couldn't fix things fast enough to keep the big fish from walking - leading to the failure of both clients. Such a shame.

    Remember my statement about faster hardware possibly being bad? You are a poster child for that with the deadlocks caused by poor structures and suboptimal indexing.

    The query compilation timeouts will have some options available to help mitigate that issue.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Indianrock (11/25/2016)


    I've emailed my manager about the possibility of bringing in a consultant.

    If you don't mind working with someone on another continent (I'm in South Africa), I don't have a huge number of pending projects.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • TheSQLGuru (11/25/2016)


    Indianrock (11/25/2016)


    ORMs work, up until they don't. Too many tables joined, sometimes with outer joins, too many values in "IN" statements ( e.g. 50+ ), parameters declared as varchar matching the string length rather than a set varchar(8000) or matching the column length. One large table ( 100 million records ) has grown through the code release process over years to 130 columns and 30 indexes -- this one is involved in deadlocks we get daily during the 3 runtimes of the report I mentioned in the original post.

    The good news is that large paying customers are being impacted by these internal reports and starting to complain loudly -- so the heat is on both development and the operations group that runs these reports.

    -- 31,350 Total Plans

    select COUNT(*) as totalPlanCount from sys.dm_exec_cached_plans

    -- 2,315 queries where compilation timed out

    select count(*)

    from sys.dm_exec_query_stats

    cross apply sys.dm_exec_sql_text(sql_handle) as qt

    cross apply sys.dm_exec_query_plan(plan_handle) as qp

    where

    charindex('StatementOptmEarlyAbortReason="TimeOut"', convert(nvarchar(max),qp.query_plan)) > 0

    The variable variable lengths obviously contribute to the total plan count. You should be thankful at least in that they are varchar and not Nvarchar.

    I have had a number of clients that hobble by with better hardware, duct tape and rubber bands until it all comes crashing down. Sadly I have had two clients bring me on board too late to save them when they tried to go up-market and I couldn't fix things fast enough to keep the big fish from walking - leading to the failure of both clients. Such a shame.

    Remember my statement about faster hardware possibly being bad? You are a poster child for that with the deadlocks caused by poor structures and suboptimal indexing.

    The query compilation timeouts will have some options available to help mitigate that issue.

    From another post I posted.

    From my personal experience.

    On one reasonably big database (of a GPS provider) I've managed to cut query response time (on average) by 7 times simply by redefining all NVARCHAR(MAX) columns as NVARCHAR(500) (all actual strings in there were not longer than 250 characters).

    7 times.

    With no index tuning, without even seeing a single line of code.

    I presume that nvarcars are not an issue.

    by default the .net code sets all var. to nvarcar (max) ugh!

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply