Long wait times for parallel queries

Question

Post reply

Long wait times for parallel queries

donald.mcneill

SSC-Addicted

Points: 473
More actions
January 6, 2017 at 3:41 am

#333553

I have an old SQL 2005 server we are currently migrating away form that has suddenly started under performing. I have tracked the issue down to queries running in parallel. There is currently only 2 cores allocated to the server (virtualised) and if un a run a query where the execution plan suggests it should be run in parallel I get one thread in a continuous CXPACKET wait (hours if left). If I use maxdop 1 option on the query it starts running but still takes a very long time to complete and will give small PAGEIOLATCH_SH waits with high reads and tiny amount of writes. Does anyone have an idea what the issue could be?

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

With just that information, no.

Can you give us more detail on the query, its execution plan at the very least?

Keep in mind that when a query parallels, one thread WILL sit with a CXPacket wait the entire time. It's the controller thread, it coordinates the others, the other threads do the work.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

donald.mcneill SSC-Addicted Points: 473 More actions · Answer 2

I've uploaded the xml query plan. Not sure how useful it will be as pretty up every query that requires parallelism is having an issue.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Before we go on, what exactly are you declaring to be the issue?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

donald.mcneill SSC-Addicted Points: 473 More actions · Answer 4

We have overnight jobs on this server that update a data warehouse. I noticed the job time had jumped from around 4 hours to 10 hours without anything changing as far as the amount of data being processed or any configuration on the server. The same job has been running stable for over 5 years without an issue like this.

On investigating the dm_exec_requests view I saw I had a query with a CXPACKET wait time of 3 hours when that query would normally complete in 5 - 10 minutes. I ran the query using MAXDOP 1 and it started executing but, looking at the statistics it was doing a lot of reading but no writing with intermittent PAGEIOLATCH_SH waits before returning to runnable status but never see running. The server is virtualised so I'm trying to determine if something has changed on the VM side rather than on the database instance.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 5

The CXPacket wait is not a problem. sys.dm_exec_requests only shows thread 0 of a parallel query and, as I explained above, thread 0 is the controller thread. It will always post a CXPacket wait for the entire of a parallel query. This is normal. It's not that the query isn't doing anything. It is, the other threads are running the query.

So, the problem here is that the query has gone from 4 hours to 10 hours without anything changing?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

donald.mcneill SSC-Addicted Points: 473 More actions · Answer 6

Yes - the job has gone from 4 - 10 hours (there are number of update queries called via a stored procedure). Queries that normally completed in a short period of time are now taking ages - looking like something is stopping them completing. I've never have lengthy wait times on this server up to a couple of days ago. I tried rebuilding the indexes on the database to see if that helped, however, I was even getting a problem doing that as the long wait time occurred again. This is an old windows 3003 32 bit server so I do not have much memory to play with (usual 4Gb) so there is a lot of paging going on in large queries so I was trying to see if there was potentially an IO issue. Pagefile is currently over 2gb in size - and remoting on to the server is very slow. Is there any metric within dm_exec_requests that will help identify an IO issue? I've looked at the reads and writes metrics and can see a lot of reading going on but little writes - 250083 reads, 105828 logical reads but only 5 writes?

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 7

It's probably an execution plan change, possibly triggered by data volume changes.

Do you have any record of what the queries were running like before this?

Do you have any historical record of execution plans?

Did all procedures suddenly go slow at one point, or has it been intermittent?

Can you try running a statistics update, WITH FULLSCAN on every single table involved in one of the slow queries?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 8

but, looking at the statistics it was doing a lot of reading but no writing with intermittent PAGEIOLATCH_SH waits before returning to runnable status but never see running.

I don't know if it'll fix your problem but we went through the same thing on a large table recently. They inserted more than 200 thousand rows in the table and what normally ran almost instantly now wouldn't return even after an hour and it had the same symptoms as above. It turned out to be index/column statistics. We rebuilt them on the big table and everything went back to being snappy.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeffrey Williams SSC Guru Points: 90252 More actions · Answer 9

donald.mcneill (1/6/2017)
Yes - the job has gone from 4 - 10 hours (there are number of update queries called via a stored procedure). Queries that normally completed in a short period of time are now taking ages - looking like something is stopping them completing. I've never have lengthy wait times on this server up to a couple of days ago. I tried rebuilding the indexes on the database to see if that helped, however, I was even getting a problem doing that as the long wait time occurred again. This is an old windows 3003 32 bit server so I do not have much memory to play with (usual 4Gb) so there is a lot of paging going on in large queries so I was trying to see if there was potentially an IO issue. Pagefile is currently over 2gb in size - and remoting on to the server is very slow. Is there any metric within dm_exec_requests that will help identify an IO issue? I've looked at the reads and writes metrics and can see a lot of reading going on but little writes - 250083 reads, 105828 logical reads but only 5 writes?

If you are sure there hasn't been a huge data load - or that statistics are stale - I would then focus on what changed in the environment. Since this server is a VM I would engage your server and storage teams to investigate and identify any changes that have occurred on the host or the SAN.

Have them monitor your server while these processes are running to see if there are any IO issues.

I would also recommend that your open perfmon on the server and monitor the physical disk and the disk seconds per read\write\throughput during your processing. You should be no higher than 10ms for read and ideally less than 6ms per write. If you see higher sustained values (not spikes) then it could indicate an issue with the IO subsystem.

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

donald.mcneill SSC-Addicted Points: 473 More actions · Answer 10

Thanks. I carried out a full index rebuild and updated statistics for the main database on the server and that seems to have helped the issue a bit. During Christmas and new year our data volumes decrease due to our client's centre being closed for holidays so a week with lower or no data coming through each day may have triggered the execution plans to change. We have also had some issues with the VM storage which didn't help.