High work_queue_count and THREADPOOL WAITS

Question

Post reply

High work_queue_count and THREADPOOL WAITS

steve.smith 42985

SSC Enthusiast

Points: 155
More actions
December 30, 2010 at 7:40 am

#232822

Hi
I am tuning the DB activity for an application that sends a high number of transactions in a very short space of time via a large number of threads.
When activity is highest I am seeing:
"SQL Server failed with error code 0xc0000000 to spawn a thread to process a new login or connection. Check the SQL Server error log and the Windows event logs for information about possible related problems."
and related connection errors on the application. Eventually the DB becomes unresponsive and I need to use the DAC to get a connection. Investigating DMVs I find that the highest WAIT stat is THREADPOOL (at 50% of resource WAITS) and SIGNAL WAITS are only < 8%
This led me to investigate sys.dm_os_schedulers, as it looked like possibe worker starvation, but I am struggling to understand the results I get:
'SELECT max_workers_count
FROM sys.dm_os_sys_info'
is 256
and yet
'SELECT scheduler_id ,
cpu_id ,
Status ,
is_online ,
is_idle ,
current_tasks_count ,
runnable_tasks_count ,
current_workers_count ,
active_workers_count ,
work_queue_count ,
pending_disk_io_count ,
load_factor
FROM sys.dm_os_schedulers
WHERE scheduler_id < 255
AND runnable_tasks_count > 0'
Shows only about 120 active workers, but 78 queued (see below)
0255VISIBLE ONLINE 109796161360188
1255VISIBLE ONLINE 10100205855422189
in fact 'select count(0) from sys.dm_os_schedulers' is never more than 130
Can anyone explain why the DB seems to become starved of workers when only half of the workers are allocated?
Regards, Steve
This environment is a 2 vCPU VM running Windows Server 2003 and SQL 2005 SP3, it is running on a HOST with 4 cores.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

colin.Leversuch-Roberts SSC Guru Points: 52643 More actions · Answer 1

well it's a bit of a guess but I'd say the combination of VMware and 2 cores just doesn't provide enough workers.

There's some things you could try: disable parallelism, enable fibres.

I'd actually suggest you try to test on a physical box and/or get more cpus.

The whole thing about performance is that it's rerely one golden bullit. For example if your transactions/batches can't finish quick enough then you'll get queuing/blocking etc. this could be that your transaction log drive can't write fast enough ( quite likely on VMware which I've found absolute rubbish under load ) what you might be seeing is the after effect rather than the cause - blocking will ramp up threads, serialisation of stored procedures ( serialised execution plans ) can cause queuing the same. You may not have enough memory, disks may be too slow .. really tricky to know for sure.

[font="Comic Sans MS"]The GrumpyOldDBA[/font]
www.grumpyolddba.co.uk
http://sqlblogcasts.com/blogs/grumpyolddba/

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 2

I would agree that turning down parallelism should help with the worker starvation.

One other thing about VM boxes that I recently ran into is that you will want to talk to the VM Administrator and make sure your drives are RAW. If they're not RAW, VM gets involved in every read and write to the drives, also slowing things down.

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Jonathan Kehayias One Orange Chip Points: 26778 More actions · Answer 3

colin.Leversuch-Roberts (1/6/2011)
well it's a bit of a guess but I'd say the combination of VMware and 2 cores just doesn't provide enough workers.
There's some things you could try: disable parallelism, enable fibres.
I'd actually suggest you try to test on a physical box and/or get more cpus.
The whole thing about performance is that it's rerely one golden bullit. For example if your transactions/batches can't finish quick enough then you'll get queuing/blocking etc. this could be that your transaction log drive can't write fast enough ( quite likely on VMware which I've found absolute rubbish under load ) what you might be seeing is the after effect rather than the cause - blocking will ramp up threads, serialisation of stored procedures ( serialised execution plans ) can cause queuing the same. You may not have enough memory, disks may be too slow .. really tricky to know for sure.

I know that this is a really old post at this point, but I happened upon it while doing some research and I have to say that the above information is absolute rubish, even at the time that this actually posted. SQL Server under VMware scales just fine, as far back as ESX 3.0 circa the year 2000 when things are done correctly. There are a number of markers in the data presented that would point to other issues that are non-VM related here that would have required further investigation to determine the root cause of the issue here.

To start off with, this is SQL Server 2005, do the CPUID value of 255 for each scheduler tells that the schedulers are affinity masked in SQL Server to specific CPUs in the system. The 2003 nature would make me assume that this is 32 bit so you have a max worker count of 255, which is purely stipulation for the data presented, but that would make the worker thread pool saturation make a lot more sense here.

I am not trying to stir up a debate about past information here, but I think there is too little information in this thread to determine that the VM aspect is overall a problem here, and it is unlikely that this could ever be substatiated at this point. The reason for this comment is simply that I don't want an incorrect assumption to be made based on the merits of this discussion on more modern and improved virtualization platforms, and it's important to take a note of the overall lake of substantiating information that would lead to applying credibility to the information above. A lot more information would be required to actually ascertain that VM overhead was an issue here.

It is incredibly unfortunate, but I've seen a lot of old threads, like this one, that have been referenced recently for why SQL Server doesn't scale under VM configurations, so I wanted to leave this note here based on that experience. At this point, I doubt seriously that this will ever be something that gets another look from the original poster since it is so old, but something for future readers to consider. Technology changes so rapidly that you have to gauge what was said in the past against current technology and understand.