Massive Blocking

Question

Post reply

Massive Blocking

Orlando Colamatteo

SSC Guru

Points: 182293
More actions
August 25, 2011 at 7:37 pm

#242276

I am trying to diagnose the root cause of a situation where all apps connecting to a particular database start hour-glassing waiting for the database to return information. The database serves data to several different desktop applications, all part of a third-party commercial suite. The app vendor provides support when the database "locks up" (as the client says it). The vendor will log in, "do some magic", and everything goes back to normal until the next time the DB locks up. Currently the DB locks up on a daily basis, sometimes multiple times per day. The vendor will not divulge any details about said magic.
I was asked to find out what the vendor is doing, and in doing so hopefully determine the root cause so it can be corrected. I am offsite when the situation arises and the client needs to move fast to continue operations (i.e. call the vendor immediately). I provided a script for the in-house IT folks to run when the database locks up, and before they call the vendor for support, that:
1. Starts a trace that captures various columns for the SP:Starting and SQL:BatchStarting events.
2. Captures to a table a snapshot of all sessions, the SQL executing on each SPID and any blocking chains they are participating in.
After the vendor does their magic I have the IT folks a run a second script that stops the trace. The issue has occurred four times in the last two days and I have narrowed it down to extensive blocking issues within the database. The "magic" has been identified as the vendor logging in, opening Activity Monitor, and killing the SPID at the head of the blocking chain.
The first snapshot shows two head blockers, both from the same client host. That has me imagining they may have kicked off two of the same report inadvertently or something of the like, but that's only in the back of my mind (or wherever imagination happens).
I used DBCC PAGE to lookup the Wait Resources on the blocked SPIDs with LCK_M_S (the rest were CXPACKET waits) and they all relate back to two heaps, each with an insane number of columns and an even more insane number of multi-column non-clustered non-unique indexes. I am thinking I need to focus on converting these two heaps into clustered tables and look to reduce the number of indexes. I have not dug in but I will be shocked if any less than half of the indexes are redundant.
I have some joins happening and my snapshot query is pretty lengthy...this is more or less how I am capturing the SQL:
SELECT [text]
FROM sys.dm_exec_sql_text(COALESCE(sys.dm_exec_requests.sql_handle, sys.dm_exec_connections.most_recent_sql_handle))
One problem I am having is that in all cases the snapshot is showing that the head blocker SPIDs are executing the following SQL:
SET TEXTSIZE 2147483647
That's it! Has anyone seen this type of thing?
Here is my snapshot code in case you see a deficiency that might be causing me to miss the SQL text for these blocker threads:
SELECT DISTINCT
[SnapshotDateTime] = GETDATE(),
[Session ID] = s.session_id,
[Wait Time (ms)] = ISNULL(w.wait_duration_ms, 0),
[Wait Type] = ISNULL(w.wait_type, N''),
[Wait Resource] = ISNULL(w.resource_description, N''),
[Total CPU (ms)] = s.cpu_time,
[Total Physical I/O (MB)] = ( s.reads + s.writes ) * 8 / 1024,
[Memory Use (KB)] = s.memory_usage * 8192 / 1024,
[User Process] = CONVERT(CHAR(1), s.is_user_process),
[Login] = s.login_name,
[Database] = ISNULL(DB_NAME(r.database_id), N''),
[Task State] = ISNULL(t.task_state, N''),
[Command] = ISNULL(r.command, N''),
[Application] = ISNULL(s.program_name, N''),
[Blocked By] = ISNULL(CONVERT (VARCHAR, w.blocking_session_id), ''),
[Head Blocker] = CASE
-- session has an active request, is blocked, but is blocking others
WHEN r2.session_id IS NOT NULL
AND r.blocking_session_id = 0 THEN '1'
-- session is idle but has an open tran and is blocking others
WHEN r.session_id IS NULL THEN '1'
ELSE ''
END,
[Open Transactions] = ISNULL(r.open_transaction_count, 0),
[Login Time] = s.login_time,
[Last Request Start Time] = s.last_request_start_time,
[Host Name] = ISNULL(s.host_name, N''),
[Net Address] = ISNULL(c.client_net_address, N''),
[Execution Context ID] = ISNULL(t.exec_context_id, 0),
[Request ID] = ISNULL(r.request_id, 0),
[Workload Group] = N'',
[Handle] = COALESCE(r.[sql_handle], c.[most_recent_sql_handle]),
[ExecSQL] = (
SELECT [text]
FROM sys.dm_exec_sql_text(COALESCE(r.[sql_handle], c.[most_recent_sql_handle]))
)
FROM sys.dm_exec_sessions s
LEFT OUTER JOIN sys.dm_exec_connections c ON ( s.session_id = c.session_id )
LEFT OUTER JOIN sys.dm_exec_requests r ON ( s.session_id = r.session_id )
LEFT OUTER JOIN sys.dm_os_tasks t ON (
r.session_id = t.session_id
AND r.request_id = t.request_id
)
LEFT OUTER JOIN (
-- In some cases (e.g. parallel queries, also waiting for a worker), one thread can be flagged as
-- waiting for several different threads. This will cause that thread to show up in multiple rows
-- in our grid, which we don't want. Use ROW_NUMBER to select the longest wait for each thread,
-- and use it as representative of the other wait relationships this thread is involved in.
SELECT *,
ROW_NUMBER() OVER ( PARTITION BY waiting_task_address ORDER BY wait_duration_ms DESC ) AS row_num
FROM sys.dm_os_waiting_tasks
) w ON ( t.task_address = w.waiting_task_address )
AND w.row_num = 1
LEFT OUTER JOIN sys.dm_exec_requests r2 ON ( r.session_id = r2.blocking_session_id ) ;
Thanks for reading 🙂
There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

AlexSQLForums SSChampion Points: 14277 More actions · Answer 1

If you have sql statements from that vendor app drop all those NC indexes and recreate only the ones you need.

Alex S

Jeff Moden SSC Guru Points: 1004704 More actions · Answer 2

Hi Orlando,

I know you know this but it'll make me feel a lot better if I say it out loud... I don't allow vendors such access. The "rule" is get on the phone with me and tell me what to look for. If it turns out to be a stupid code problem, be prepared to fix it or eat some high velocity pork chops. 😀

Ok... I got that out of my system. 😛

I've not seen a SET TEXTSIZE ever cause a problem although I'd certainly wonder why they need to do such a thing. Although it may not be the cause of a problem, it could be an indication of a larger problem which may be the cause of the blocking.

Shifting gears, I've seen such a thing happen when one SPID is doing an insert where low cardinality indexes are involved especially if the index doesn't reflect the correct insert order (and, they usually don't). The problem will, of course, be exacerbated if the problem index is a clustered index.

If it were me, I'd check for deadlocks in the log and I'd check for low cardinality indexes in the area of concern before I did much else on this problem. One other thing that I'd check for is to see if they had built a bloody "NextID" sequence table. Most people just don't know the right way to do such a thing.

I don't envy you on this problem. Problems like this are just not my idea of fun. As a bit of a side bar, my first inclination would be to tell the vendor you want a patch to fix the problem or you'll start searching for someone who knows what they're doing.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Evil Kraig F SSC Guru Points: 100851 More actions · Answer 3

Set Textsize shouldn't cause the blocking.

What it indicates WILL cause blocking. Especially if they fubared the design.

For those playing at home since OPC knows this because he uses google too... 🙂 : Set Textsize is a limiter on blob sizes that can be returned in a SELECT statement.

That means that you've got BLOBs in there. My guess is they're NOT in their own filegroup, for starters. Add that to the massiveness of their bloat and dealing with dueling locking mechanics and you're ripe for pain.

LCK_M_S is merely a shared lock wait. My curiousity was it on a table, page, or row? At a guess... table.

Put enough of these things together concurrently... yeaaaah. I personally would start with a search of sysmodules for that statement (SET TEXTSIZE 2147483647) and see what procs pop. If there aren't any... go beat the vendor to death with pork chops for improperly using dynamic SQL.

From there, I'd dig into the schemas looking for your LOB columns, and then do some directed review on their physical construction.

We know vendors are never wrong until you boot them in the head, so you're going to need something to use for a boot. If you're running into BLOB locks, that's a pretty solid boot. :w00t:

- Craig Farrell

Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

For better assistance in answering your questions[/url] | Forum Netiquette
For index/tuning help, follow these directions.[/url] |Tally Tables[/url]
Twitter: @AnyWayDBA

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 4

Jeff Moden (8/26/2011)
Hi Orlando,
I know you know this but it'll make me feel a lot better if I say it out loud... I don't allow vendors such access. The "rule" is get on the phone with me and tell me what to look for. If it turns out to be a stupid code problem, be prepared to fix it or eat some high velocity pork chops. 😀
Ok... I got that out of my system. 😛

No doubt! I would insist on the same. However I am not onsite when the issues arise, and am not the front-line support when the system has to be brought back to working state. I am more of the after-the-fact see what happened and make recommendations on what we can do to mitigate the problems. The folks that do the front-line support onsite are not DBA resources so it has become a ritual for them to simply call the vendor and wait for the vendor support team to say all systems are go again.

My client has spoken to other owners of this particular software package and they do not have these same issues, but then again he is managing 10x the amount of data the next closest installation so the comparison is not apples to apples.

I've not seen a SET TEXTSIZE ever cause a problem although I'd certainly wonder why they need to do such a thing. Although it may not be the cause of a problem, it could be an indication of a larger problem which may be the cause of the blocking.

I have only seen this used by non-Microsoft drivers (e.g. PHP PDO) to set a few options to what they consider "standard" as an initial batch after making every connection to the database. WHy it would be the only thing in the buffer for that session though...I have no idea.

Shifting gears, I've seen such a thing happen when one SPID is doing an insert where low cardinality indexes are involved especially if the index doesn't reflect the correct insert order (and, they usually don't). The problem will, of course, be exacerbated if the problem index is a clustered index.

This is where I am focusing at the moment. I have defragmented the heaps I have found to be listed as Wait Resources in the data collected surrounding the blocking incidents so far. I am hoping for a reduction in blocking incidents as a result, and if things improve I'll be delving into that more.

If it were me, I'd check for deadlocks in the log and I'd check for low cardinality indexes in the area of concern before I did much else on this problem.

Thanks. I turned on trace flag 1222. Re: indexes, I have noticed many of the indexes on these heaps I have visited so far have the same two leading columns. As for whether the vendor really needs a covering index on all their queries, who knows, but for now it appears that the heaps are over-indexed. Fundamentally speaking I will be recommending that all be converted to clustered tables. All I have examined so far have no PK, but do have a unique non-clustered index on an identity column...why they did not use the identity column as a clustered primary key and left the table as a heap I have no idea. Does anyone know of barriers to why one would not make that the clustered primary key? I cannot think of any reason not to, and cannot think of any logical behavior changes brought about as a result of doing so, i.e. that change should be transparent.

One other thing that I'd check for is to see if they had built a bloody "NextID" sequence table. Most people just don't know the right way to do such a thing.

Luckily, no, they have not.

I don't envy you on this problem. Problems like this are just not my idea of fun.

Fun, that would be a stretch...good experience, definitely.

As a bit of a side bar, my first inclination would be to tell the vendor you want a patch to fix the problem or you'll start searching for someone who knows what they're doing.

That's being discussed. The client is looking to put a ton more data into this system and at the moment everything is on hold.

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 5

Evil Kraig F (8/26/2011)
Set Textsize shouldn't cause the blocking.
What it indicates WILL cause blocking. Especially if they fubared the design.
For those playing at home since OPC knows this because he uses google too... 🙂 : Set Textsize is a limiter on blob sizes that can be returned in a SELECT statement.
That means that you've got BLOBs in there. My guess is they're NOT in their own filegroup, for starters. Add that to the massiveness of their bloat and dealing with dueling locking mechanics and you're ripe for pain.

I think I get where you're going with seeing TEXTSIZE. I looked for LOB typed columns in the DB and came up empty and all physical index stats show IN_ROW_DATA so it may be a red herring in this particular case. Your line of thinking has been noted for future reference. Thank you for that.

LCK_M_S is merely a shared lock wait. My curiousity was it on a table, page, or row? At a guess... table.

On all incidents I am seeing page locks, with the exception of one case where I saw a keylock. All wait resources traced back to an index on a heap.

Put enough of these things together concurrently... yeaaaah. I personally would start with a search of sysmodules for that statement (SET TEXTSIZE 2147483647) and see what procs pop. If there aren't any... go beat the vendor to death with pork chops for improperly using dynamic SQL.

That search came up empty. I hope I am not dealing with dynamic sql here...

From there, I'd dig into the schemas looking for your LOB columns, and then do some directed review on their physical construction.
We know vendors are never wrong until you boot them in the head, so you're going to need something to use for a boot. If you're running into BLOB locks, that's a pretty solid boot. :w00t:

I'll keep you posted as this unfolds.

Thanks to both of you for your comments!

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

Orlando Colamatteo SSC Guru Points: 182293 More actions · Answer 6

After monitoring this system for the last 2+ weeks I have concluded that this is a projection of everyone's worst fears regarding violating, and possibly the interesting combination of, these best practices:

1. Do have a unique clustered index on every table.

2. Do not over-index tables with a high volume of inserts, updates and deletes.

3. Do not set fill factor to 100% by default (I like 90) and choose carefully on high-volume transactional tables.

4. Do regular index maintenance (and in this case heap maintenance too).

I was able to get the green light from the vendor to convert a unique non-clustered index on an identity column present on all of the heaps (at least they were consistent) I am monitoring to a unique clustered index. That will promote the heaps into the domain where a standard index reorg/rebuild routine will handle process.

My next goal is to get the vendor to agree to let me take over the non-clustered index tuning, i.e. to allow me to consolidate or drop redundant indexes and add new indexes as I see fit. They are checking to make sure they do not use table hints with named indexes in their code before signing off on that.

I finally have the system purring..."system locks" went from 15-20/day down to 0 😀

...but it was not without a cost...I am having to rebuild stats twice per day, am running an index maintenance job nightly and a heap maintenance job manually as needed until we make the clustered index changes outlined above.

Thanks for the input gents! It's awesome to know there are other like-minded people out there to lean on for guidance and share my pain with 🙂

There are no special teachers of virtue, because virtue is taught by the whole community.
--Plato

aureolin Hall of Fame Points: 3521 More actions · Answer 7

In my third party app, I typically see the "Set Textsize XXXXXX" as simply the first statement in a batch. It's the second or third statement in the batch that's causing the problem.