Mind blowing parallelism issue

  • To be clear, I’m not posting this looking to have anyone tell me exactly how I should set MAX DOP or the threshold. I realize this is more voodoo than exact science. What I am seeking is any information or suggestions for a particular issue we had yesterday.

    We have an application (OLTP) that has its own database server (SQL 2008 SP2 (I know it needs and update but downtime is hard to come by)) and generally things run very well. However, calls started coming in with users saying they could open up “these” records but not “those” records and the application was timing out while trying to look up the problematic records. The app has decent logging so I was able to capture the statements it claimed were time outs and run them manually. These statements came back in a snap so I have no idea how a timeout could occur on them. I ran a trace of all app activity related to these look ups and all those statements ran fast as well. Of course, before I did all this I looked for the obvious such as locking/blocking and anything else under the sun. As far as I could tell the server had nothing going on that was detrimental to performance.

    This server is also used by a piggy back set of applications written in-house that add additional functionality to the 3rd party app mentioned previously. There are times when it executes queries that cause parallelism and over the last 8 months or so I have tweaked the MAX DOP setting trying to find a balance. When I set it low the users of the 3rd party app are happy but the add-on apps are not and if I switch it the other way the situation reverses.

    While we were having the issue with the look ups I checked to see how much parallelism was going on and it was nothing out of the ordinary. The server is a dual Xeon 4 core with hyper threading on so there are sixteen logical processors available and I had the MAX DOP set to eight. I had no reason to change this value based on anything I had seen but I decided to take a chance and I dropped it to four.

    What happened? All of the look up problems went away. I could understand this if all records had an issue but it was only some of them. These records are all the same type as far as table structure, data types, etc. and the look up method is the same for either the good or problematic records. So the shot in the dark change of MAX DOP resolved the issue but why? Of course we were happy to get past the issue but we do not like that we have absolutely no idea why this arbitrary change was so effective.

    Cheers

  • Do you have indexing on any of these tables? I ask because of memory allocation requirements on higher DoPs.

    From: http://technet.microsoft.com/en-us/library/ms190787.aspx

    For both aligned and nonaligned indexes, the memory requirement can be greater if SQL Server is applying degrees of parallelism to the build operation on a multiprocessor computer. This is because the greater the degrees of parallelism, the greater the memory requirement. For example, if SQL Server sets degrees of parallelism to 4, a nonaligned partitioned index with 100 partitions requires sufficient memory for four processors to sort 4,000 pages at the same time, or 16,000 pages. If the partitioned index is aligned, the memory requirement is reduced to four processors sorting 40 pages, or 160 (4 * 40) pages. You can use the MAXDOP index option to manually reduce the degrees of parallelism.

    Sorry if this doesn't apply to your situation, just rolling ideas in my head.

  • I would think that looking at the execution plans with 4 and 8 DOP would be a starting point.

    Any chance that your tempdb setup is not optimised for MAXDOP 8?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • There are indexes on these tables but nothing "exotic" as they are your typical clustered primary key and a few other indexes on relevant columns. On this server tempdb is not on its own array and can't be as this server is maxed out. In the future that will change when the server is replaced. I did look at tempdb but there was no growth and the metrics for read/write, etc. were where they typically are.

    The queries that trigger parallelism are not generated by the main application (the one with look up issues) and I've even had the programmer add the query hint for MAXDOP 1 to some of the stuff to control this. The kicker here is that nothing (as far as we know) has changed with respect to how anything is done. No application changes or DB changes were made with the exception of the MAX DOP setting yesterday.

    What really bothers me here is that this setting change corrected the issue with looking up certain tickets. As mentioned, the method to look these up, how they are stored, etc. is exactly the same as those with no look up issue. As far as the application, storage and other components go these *should* be a one-to-one type of thing. I could see it if everything was slow meaning, all tickets had a look up issue but this wasn't the case.

    Cheers

  • That could be a data skew and some form of parameter sniffing and changing MAXDOP has just had a lucky effect on compiling a stable plan that is good enough for all sets of parameters.

    Generally I'd play around with the Cost Threshold for Parallelism setting before messing around too much with MAXDOP.

  • Viewing 5 posts - 1 through 4 (of 4 total)

    You must be logged in to reply to this topic. Login to reply