Running out of workers

  • Erland

    Our 'max worker threads' is set at the default "0"

    Next time I see this condition I will query the sys.dm_os_workers tables directly to get the count.

    Since the Activity Monitor in SQL Server 2008 is so bad I usually take a quick look at it then run a couple of different queries that I found on this forum to really see what is going on.

    I have wondered if maybe there is a bad join in one of the queries. If you have time and are willing, I could post the two queries that I use for you to see. On the surface they both look correctly formed but there was one table/view that I am not familiar with so I took the poster's word for the correctness. For "normal" processes the number of records returned is accurate so it gave me the feeling that the number for the massive records could be correct.

  • ELLEN-610393 (10/17/2013)I have wondered if maybe there is a bad join in one of the queries. If you have time and are willing, I could post the two queries that I use for you to see. On the surface they both look correctly formed but there was one table/view that I am not familiar with so I took the poster's word for the correctness. For "normal" processes the number of records returned is accurate so it gave me the feeling that the number for the massive records could be correct.

    Yes, it sounds like there is a join problem. Please post your queries, although it may take some days before I get the time to look at them. But maybe someone who is not at PASS can step in.

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Erland Sommarskog (10/16/2013)


    dwilliscp (10/16/2013)


    So if you have one SPID, but 700 rows, that means the server is trying to do parallel processing? They all show CXPacket.. So how do I track down whay this is going on? (about twice per week)?

    You and Ellen are completely unrelated, aren'y you? I am just wondering if we are trying to to solve one or two problems here.

    Where do you see 700 rows? CXPacket waits token of a parallel query, where some threads have completed their job and are waiting for some other thread to complete. Typically this happens when the optimizer makes a misestimation on how to partition the data over the threads.

    To see what is going on, the first step is to find which querythe process is running. My own favourite tool is beta_lockinfo, but then again I wrote it. You find it on http://www.sommarskog.se/sqlutil/beta_lockinfo.html. This gives you a snapshot on what is going on in the system, including current query and plan.

    Unrelated ... The following is one of the queries that I use to show what is going on... when the server is not responding, and unlike the worker query, this is the one that showed me the 700 rows with the same SPID (all generated by the same query), and I could not see one that was not CXPACKET..but maybe I should have filtered out the CXPACKET wait types after I noticed the SPID creating all the issues.

    USE master

    GO

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

    SELECT

    er.session_Id AS [Spid]

    , sp.ecid

    , er.start_time

    , DATEDIFF(SS,er.start_time,GETDATE()) as [Age Seconds]

    , sp.nt_username

    , er.status

    , er.wait_type

    , SUBSTRING (qt.text, (er.statement_start_offset/2) + 1,

    ((CASE WHEN er.statement_end_offset = -1

    THEN LEN(CONVERT(NVARCHAR(MAX), qt.text)) * 2

    ELSE er.statement_end_offset

    END - er.statement_start_offset)/2) + 1) AS [Individual Query]

    , qt.text AS [Parent Query]

    , sp.program_name

    , sp.Hostname

    , sp.nt_domain

    FROM sys.dm_exec_requests er

    INNER JOIN sys.sysprocesses sp ON er.session_id = sp.spid

    CROSS APPLY sys.dm_exec_sql_text(er.sql_handle)as qt

    WHERE session_Id > 50

    AND session_Id NOT IN (@@SPID)

    ORDER BY session_Id, ecid

  • When in doubt, tinker. Tinker with what you are allowed to change (which doesn't sound like much). Can you set MAXDOP to 4? MAXDOP 0 can be a killer as it'll bulk up on threads for a query on the assumption nothing else is going to run. A few milliseconds later when 100 similar queries have been executed, VOILA! Thousands of extra threads. For my dollar, even with hyperthreading MAXDOP should be set no higher than the number of physical cores - half that in AMD environments or busy Intel environments.

    Usually things are disk bound anyway. Sure would be nice to install a few SSDs and use them for TEMPDB - that can make a huge difference everywhere.

    Oh - and bad statistics. That could indirectly cause a lot of parallelism and extra threads.

  • Hello,

    You can use the query below, execute it every 30 seconds or so .. and whatch the number of connections as well !

    SELECT

    SUM(current_workers_count)current_visible_workers,

    SUM(active_workers_count)active_visible_workers,

    (SELECT COUNT(*) FROM sys.dm_exec_connections) connections

    FROM sys.dm_os_schedulers

    -- Those schedulers that have IDs greater than or equal to 255 are used internally by SQL Server

    -- check BOL !

    WHERE scheduler_id < 255

    Cheers,

    R

  • ..

  • dwilliscp (10/17/2013)


    FROM sys.dm_exec_requests er

    INNER JOIN sys.sysprocesses sp ON er.session_id = sp.spid

    Here is a potential problem. This join is good in most cases, but not if you use MARS (Multiple Active Result Sets) or SOAP requests, in which case there may be multiple requests for the same session id.

    If we ignore that, and assume that a single request is generating 700 worker threads, that is is excessive. You need to look into the query plan for that query, and investigate if you can add indexes or improve the query.

    There has been some other suggestions for diagnostic queries. I can offer beta_lockinfo, whicb you find at http://www.sommarskog.se/sqlutil/beta_lockinfo.html. This gives you both the query, the plan as well as a bunch of other information. And itjoins the various DMVs correctly. :--)

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • ELLEN-610393 (10/17/2013)


    Erland

    Our 'max worker threads' is set at the default "0"

    Next time I see this condition I will query the sys.dm_os_workers tables directly to get the count.

    Since the Activity Monitor in SQL Server 2008 is so bad I usually take a quick look at it then run a couple of different queries that I found on this forum to really see what is going on.

    I have wondered if maybe there is a bad join in one of the queries. If you have time and are willing, I could post the two queries that I use for you to see. On the surface they both look correctly formed but there was one table/view that I am not familiar with so I took the poster's word for the correctness. For "normal" processes the number of records returned is accurate so it gave me the feeling that the number for the massive records could be correct.

    Thank you [anyone] that can look at this query and if there is a flaw I welcome correction.

    The query that I use is:

    SELECT

    [Session ID] = s.session_id,

    [User Process] = CONVERT(CHAR(1), s.is_user_process),

    [Login] = s.login_name,

    [Database] = ISNULL(db_name(p.dbid), N''),

    [Task State] = ISNULL(t.task_state, N''),

    [Command] = ISNULL(r.command, N''),

    [Application] = ISNULL(s.program_name, N''),

    [Wait Time (ms)] = ISNULL(w.wait_duration_ms, 0),

    [Wait Type] = ISNULL(w.wait_type, N''),

    [Wait Resource] = ISNULL(w.resource_description, N''),

    [Blocked By] = ISNULL(CONVERT (varchar, w.blocking_session_id), ''),

    [Head Blocker] =

    CASE

    -- session has an active request, is blocked, but is blocking others or session is idle but has an open tran and is blocking others

    WHEN r2.session_id IS NOT NULL AND (r.blocking_session_id = 0 OR r.session_id IS NULL) THEN '1'

    -- session is either not blocking someone, or is blocking someone but is blocked by another party

    ELSE ''

    END,

    [Total CPU (ms)] = s.cpu_time,

    [Total Physical I/O (MB)] = (s.reads + s.writes) * 8 / 1024,

    [Memory Use (KB)] = s.memory_usage * 8192 / 1024,

    [Open Transactions] = ISNULL(r.open_transaction_count,0),

    [Login Time] = s.login_time,

    [Last Request Start Time] = s.last_request_start_time,

    [Host Name] = ISNULL(s.host_name, N''),

    [Net Address] = ISNULL(c.client_net_address, N''),

    [Execution Context ID] = ISNULL(t.exec_context_id, 0),

    [Request ID] = ISNULL(r.request_id, 0),

    [Workload Group] = ISNULL(g.name, N'')

    FROM sys.dm_exec_sessions s WITH (NOLOCK)

    LEFT OUTER JOIN sys.dm_exec_connections c WITH (NOLOCK) ON (s.session_id = c.session_id)

    LEFT OUTER JOIN sys.dm_exec_requests r WITH (NOLOCK) ON (s.session_id = r.session_id)

    LEFT OUTER JOIN sys.dm_os_tasks t WITH (NOLOCK) ON (r.session_id = t.session_id AND r.request_id = t.request_id)

    LEFT OUTER JOIN

    ( -- In some cases (e.g. parallel queries, also waiting for a worker), one thread can be flagged as

    -- waiting for several different threads. This will cause that thread to show up in multiple rows

    -- in our grid, which we don't want. Use ROW_NUMBER to select the longest wait for each thread,

    -- and use it as representative of the other wait relationships this thread is involved in.

    SELECT *, ROW_NUMBER() OVER (PARTITION BY waiting_task_address ORDER BY wait_duration_ms DESC) AS row_num

    FROM sys.dm_os_waiting_tasks WITH (NOLOCK)

    ) w ON (t.task_address = w.waiting_task_address) AND w.row_num = 1

    LEFT OUTER JOIN sys.dm_exec_requests r2 WITH (NOLOCK) ON (s.session_id = r2.blocking_session_id)

    LEFT OUTER JOIN sys.dm_resource_governor_workload_groups g WITH (NOLOCK) ON (g.group_id = s.group_id)

    LEFT OUTER JOIN sys.sysprocesses p WITH (NOLOCK) ON (s.session_id = p.spid)

    WHERE s.session_id > 50

    ORDER BY s.session_id;

  • If one query is generating several hundred worker threads that really sounds like you have MAXDOP set to 0 (the default). Since you know the problematic query apparently, just take that query and append OPTION (MAXDOP 4) to the end of it and fire it off in a few query windows. If it doesn't cause a couple thousand worker threads you've found your problem.

  • ELLEN-610393 (10/18/2013)


    LEFT OUTER JOIN sys.sysprocesses p WITH (NOLOCK) ON (s.session_id = p.spid)

    Here is the problem. Sysprocess does not have one per process, it has one per process and execution context id. And sometimes there can even be multiple rows for ecid = 0. Since you already have one per task, you get a complete explosion here.

    Does this query come from the Activity Monitor?

    In my beta_lockinfo, I do this to avoid this problem:

    JOIN (SELECT spid, dbid = MIN(dbid), open_tran = MIN(open_tran)

    FROM sys.sysprocesses

    WHERE ecid = 0

    GROUP BY spid) AS sp ON sp.spid = es.session_id

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Erland:

    Thank you, I will use your replacement in the query.

    and now that you ask, that query may be from the Activity Monitor. I do know that one day I did capture a query that I think was from the connection for the Activity Monitor.

  • I've made some research and the query does indeed come from Activity Monitor. Furthermore, the query is the same in SQL 2012 SP1. And for that matter, SQL 2014 CTP2.

    I found a Connect item from 2009, https://connect.microsoft.com/SQLServer/feedback/details/496162/duplicated-lines-in-ssms-2008-r2-activity-monitor, that reported the problem. However, it was closed as Won't Fix. It was not a priority...

    [font="Times New Roman"]Erland Sommarskog, SQL Server MVP, www.sommarskog.se[/font]

  • Erland ... I so appreciate that you have kept with this issue!

    Personally I think that for them to knowingly continue to use a query that is incorrect is pathetic.

    Many of us are doing our best to learn how to properly manage systems for our employers. We should be able to depend on the product owners to correct mistakes when discovered. You would think at the minimum they would have corrected it in the newer versions.

    I put your fix into the version of the script that I now use.

    Thank you.

Viewing 13 posts - 31 through 42 (of 42 total)

You must be logged in to reply to this topic. Login to reply