August 24, 2011 at 2:11 pm
Hi Techies,
I am an ETL developer. Of course, everyday I will be working with millions and millons of rows from source to target. One of the key item I would be expecting for better performance is, "Query to use Parallelism". The reason is, my production server is configured with 128 CPU's and there will be an amazing execution time differences when ETL runs with the query that uses Parallelism and not using Parallelism.
From my previous thread, I have learnt "Not to use the Row_Level functions when expecting Parallelism". In my recent code, I am facing new situation where my code is not picking for parallelism.
Note: There is no Row_level_Function, No Views.
Sample code 1
-- Kind of Actual Code but Query is not Using parallelism
SELECT t1.Col1, t1.Col2, t3.Col3, t4.Col4
FROM table1 t1
JOIN table2 t2 On t1.Col1 = t2.Col1
JOIN table3 t3 On t2.Col2 = t3.Col1
JOIN table4 t4 On t3.Col3 = t4.Col1
JOIN table5 t5 On t4.Col4 = t5.Col1
WHERE t1.Col3 = @Variable
-- Just modified the select clause and the Query is Using "Parallelism" (Even though, I was not bothered abt o/p)
SELECT t1.* -- Just modified the select clause
FROM table1 t1
JOIN table2 t2 On t1.Col1 = t2.Col1
JOIN table3 t3 On t2.Col2 = t3.Col1
JOIN table4 t4 On t3.Col3 = t4.Col1
JOIN table5 t5 On t4.Col4 = t5.Col1
WHERE t1.Col3 = @Variable
Sample code 2
-- Kind of Actual Code but Query is not Using parallelism
SELECT t1.*
FROM table1 t1
JOIN table2 t2 On t1.Col1 = t2.Col1
JOIN table3 t3 On t2.Col2 = t3.Col1 AND t1.Col2 = t3.Col3
JOIN table4 t4 On t3.Col3 = t4.Col1
JOIN table5 t5 On t4.Col4 = t5.Col1
WHERE t1.Col3 = @Variable
-- Just removed the JOIN CONDN and the Query is Using "Parallelism" (Even though, I was not bothered abt o/p)
SELECT t1.* -- Removed additional Join Condn
FROM table1 t1
JOIN table2 t2 On t1.Col1 = t2.Col1
JOIN table3 t3 On t2.Col2 = t3.Col1 /*AND t1.Col2 = t3.Col3*/
JOIN table4 t4 On t3.Col3 = t4.Col1
JOIN table5 t5 On t4.Col4 = t5.Col1
WHERE t1.Col3 = @Variable
I am very much curious to know about the reason for the situation!!!If all the above issues are related to any common factor and I am missing it.. Please give me a pointer for it.
I believe there are few other stuffs which will be leading the SQL SERVER not to use parallelism. In this post, I would like to create the consolidated list out of it. I am sure most of you guys came across many interesting situations and challenges related to Parallelism. If possible, please share your challenges and workaround that made you To-Win??? It is highly appreciated.
Here is my previous thread on (No Parallelism When using Row level function) http://www.sqlservercentral.com/Forums/Topic1152095-360-1.aspx
August 24, 2011 at 3:53 pm
If I had to guess, that mods you are doing completely change the execution plans. More importantely the estimates.
I can't say for sure without seeing the actual execution plan + your current cost threshold for parallelism
August 25, 2011 at 12:36 am
Based on Remi's reply:
Scenario 1:
I'd expect table 1 has a covering index on Col1 to Col4, but there are more columns. Changing to t1.* forces either a table scan or a Bookmark Lookup to get the other values. This, in turn, makes the query more complex and -based on the threshold value- changes to parallelism.
Scenario 2:
There seems to be a covering index on t3 with including col3 (probably as the leading column).
When removing the join condition, a t3 table scan is forced.
But that's just guessing based on the description... Actual execution plans will tell the truth 😀
Speaking of Parallelism:
If SQL Server "decide" not to use parallelism, it's usually because the query estimate is below the threshold. If in reality the query takes a significant time, it might be because of bad estimates (e.g. due to outdated statistics) or a threshold value that is set too high.
Edit: no need to use Parallelism is usually a sign of a well designed query 🙂
August 25, 2011 at 6:45 am
Parallelism is not something that simply occurs in SQL Server. It's based on two things, the cost threshold you have set for parallelism. By default this value is 5. Personally, I usually change it to a much higher number, say 40. The cost is based on estimates of the cost of operations inside an execution plan, accumulated for the entire plan. The cost is compared to the threshold and if a query cost exceeds the threshold, it might get a parallel plan. It's still not completely automatic. There are decisions made by the query engine that determine when/if you get parallel queries.
The issue is, the estimates are just that, estimates. They can be wrong. You then end up with a parallel plan where the cost of splitting the access to multiple threads and then marshaling them back together outweighs the benefits conferred by having multiple threads operating on the data. There is no hard and fast precise rule for when parallelism is and when it is bad, but as a very general rule of thumb, I would not want parallelism on queries that are returning less than 10's of thousands of rows.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 25, 2011 at 6:55 am
How did yo come up with that 10K suggestion?
August 25, 2011 at 7:12 am
Ninja's_RGR'us (8/25/2011)
How did yo come up with that 10K suggestion?
Basically, wet my finger, stick it in the air, see which direction the wind is blowing...
I've seen benefits with larger data sets, and almost none, ever, with smaller data sets. 10K is a reasonably large number for queries. That's about it. Like I said, a VERY general suggestion.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 25, 2011 at 7:28 am
Grant Fritchey (8/25/2011)
Ninja's_RGR'us (8/25/2011)
How did yo come up with that 10K suggestion?Basically, wet my finger, stick it in the air, see which direction the wind is blowing...
I've seen benefits with larger data sets, and almost none, ever, with smaller data sets. 10K is a reasonably large number for queries. That's about it. Like I said, a VERY general suggestion.
Hence my term suggestion and NOT recommendation. 😀
I guess this especially applies if you have a dataset with 10k+ rows that get summerized in any way for a report even if the report returns way less than 10K rows?
August 25, 2011 at 7:32 am
Grant,
do you really refer to the number of rows being returned?
Example: a query returning 100K rows should benefit from parallelism but the very same query expanded by a TOP 1000 ORDER BY should not? (assuming the only change in the execution plan is a SORT and a FILTER added at the end)
August 25, 2011 at 7:36 am
LutzM (8/25/2011)
Grant,do you really refer to the number of rows being returned?
Example: a query returning 100K rows should benefit from parallelism but the very same query expanded by a TOP 1000 ORDER BY should not? (assuming the only change in the execution plan is a SORT and a FILTER added at the end)
True, if the data is manipulating 100K rows, it's manipulating 100k rows and might benefit from parallelism. Like I said, general suggestion (thanks Ninja). These are not hard & fast rules, please, please, please don't read too much into it. Big sets can benefit. I've never seen small sets benefit (not that it can't happen, but that I haven't seen it). Big and Small are very hard to define. I picked, as a SWAG, 10K as the start of Big.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
August 25, 2011 at 7:48 am
Makes sense.
You basically used "returning less than x thousands rows" in the meaning of processing that much data not what ends up as the final result. Right?
August 25, 2011 at 7:53 am
LutzM (8/25/2011)
Makes sense.You basically used "returning less than x thousands rows" in the meaning of processing that much data not what ends up as the final result. Right?
I was going to say something like "I think the best measure is the estimated cost".
Sometimes I just need to shut it :-D.
For me I'll just leave it at OLAP = Paral. OLTP = maxdop 1.
Works for my system for now. 😉
August 25, 2011 at 8:17 am
Ninja's_RGR'us (8/25/2011)
LutzM (8/25/2011)
Makes sense.You basically used "returning less than x thousands rows" in the meaning of processing that much data not what ends up as the final result. Right?
I was going to say something like "I think the best measure is the estimated cost".
Sometimes I just need to shut it :-D.
For me I'll just leave it at OLAP = Paral. OLTP = maxdop 1.
Works for my system for now. 😉
Isn't it great to say "performance matters" in so many ways? 😀
August 25, 2011 at 8:23 am
It depends.
That's pretty much the only thing left to say atm!
:hehe:
August 25, 2011 at 9:10 am
Ninja's_RGR'us (8/25/2011)
It depends.That's pretty much the only thing left to say atm!
:hehe:
+1
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
Viewing 14 posts - 1 through 13 (of 13 total)
You must be logged in to reply to this topic. Login to reply