June 19, 2009 at 11:18 am
I'm trying to determine why a query with an aggregated value (SUM()) would take 6- 8 times longer to run when using a virtual table verses using a sub-query. Unfortunately due to the NDA we have with the software company whose system we use I can't post the actual DDL that is often accompanied with these kind of posts. What I have done is recreated (at least tried to recreate) the same scenario with the Northwinds sample DB.
Below are 2 variations on the same query. The first uses a Sub-query to calculate the number of orders for a customer while the second uses a Derived table to obtain the same value.
/*Sub-Query VERSION:*/
SELECT C.ContactName, C.CompanyName,
IsNull(CCD.CustomerTypeID,'N/A'),
(SELECT Sum(O1.OrderID)
FROM Customers C1 Join Orders O1 ON C1.CustomerID = O1.CustomerID
WHERE 1 = 1
AND C1.CustomerID = C.CustomerID) AS 'iNumOrders'
FROM dbo.Customers C Left Outer Join CustomerCustomerDemo CCD ON C.CustomerID = CCD.CustomerID
WHERE 1 = 1
AND C.ContactName = 'Mario Pontes'
/*Derived Table VERSION:*/
SELECT C.ContactName, C.CompanyName,
IsNull(CCD.CustomerTypeID,'N/A'),
ORD.iNumOrders
FROM dbo.Customers C Left Outer Join CustomerCustomerDemo CCD ON C.CustomerID = CCD.CustomerID
Join (SELECT O1.CustomerID AS 'CustomerID',
Sum(O1.OrderID) AS 'iNumOrders'
FROM Customers C1 Join Orders O1 ON C1.CustomerID = O1.CustomerID
WHERE 1 = 1
AND C1.ContactName = 'Mario Pontes'
GROUP BY O1.CustomerID
) ORD ON C.CustomerID = ORD.CustomerID
WHERE 1 = 1
AND C.ContactName = 'Mario Pontes'
Because the Northwinds DB is small the difference between the duration for these 2 queries is negligable.
In our live DB, which is greater then 100GB, this same aprpoach results in a much larger difference between these 2 queries. The sub-query version runs in 7-9 seconds while the Derived table version takes between 45 & 50 seconds to complete.
There may not be enough here to answer my specific situation but in general, isn't it true that when all conditions are equal, a sub-query version of a query should take longer to run? My take is the sub-query version executes the subquery once for every Customer listed in the WHERE clause. The Derived Table version executes the query only once, returning a row or record for every customer listed in the WHERE clause.
Thoughts?
Kindest Regards,
Just say No to Facebook!June 19, 2009 at 11:55 am
MORE ON THE SUB-QUERY vs DERIVED TABLE
Here are the results of STATISTICS IO for the 2 methods. I have changed the table names (except for Worktable) to be able to share this and still honor our NDA.
SUB-QUERY METHOD
----------------
Table 'ClientTranHistory'. Scan count 56758, logical reads 405520, physical reads 519, read-ahead reads 181,
Table 'Worktable'. Scan count 366, logical reads 785307, physical reads 0, read-ahead reads 0,
Table 'Client'. Scan count 1, logical reads 54590, physical reads 0, read-ahead reads 0,
Table 'ChargeType'. Scan count 183, logical reads 1095, physical reads 0, read-ahead reads 0,
Table 'Properties'. Scan count 3, logical reads 14, physical reads 0, read-ahead reads 0,
Table 'Attribute'. Scan count 1, logical reads 30, physical reads 0, read-ahead reads 0,
Table 'PropertyGroup'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0,
DERIVED TABLE METHOD
--------------------
Table 'Attribute'. Scan count 1, logical reads 30, physical reads 0, read-ahead reads 0,
Table 'PropertyGroup'. Scan count 2, logical reads 6, physical reads 0, read-ahead reads 0,
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0,
Table 'Properties'. Scan count 3, logical reads 13, physical reads 0, read-ahead reads 0,
Table 'ClientTranHistory'. Scan count 56758, logical reads 1542971, physical reads 967, read-ahead reads 2945,
Table 'Client'. Scan count 3, logical reads 1412, physical reads 0, read-ahead reads 152,
Table 'ChargeType'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0,
It's interesting to see that the Derived Table method has much larger values then the SUb-Query method for the ClientTranHistory table which is the table with the column that we are aggregating. It seems that the Derived Table does reduce the number of times the outer tables are acessed by increases by a factor of 4 or more the amount of access activity for the table whose data is being aggregated.
Thoughts?
Kindest Regards,
Just say No to Facebook!June 19, 2009 at 11:59 am
YSLGuru (6/19/2009)
There may not be enough here to answer my specific situation but in general, isn't it true that when all conditions are equal, a sub-query version of a query should take longer to run? My take is the sub-query version executes the subquery once for every Customer listed in the WHERE clause.
In general, no. That may have been true in a much earlier version of SQL (pre 7), but the current optimiser can convert one into the other fairly well in most cases. The cases where it can't is when the correlation expression (what joins a subquery with the outer query) is an inequality or when there's a TOP 1 in the subquery. In those two cases (and I believe only those cases), a correlated subquery runs once per row of the outer query.
Unrelated maybe, but puzzling. WHERE 1=1?
Can you maybe post the exec plan of the two queries?
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 19, 2009 at 12:06 pm
GilaMonster (6/19/2009)
YSLGuru (6/19/2009)
There may not be enough here to answer my specific situation but in general, isn't it true that when all conditions are equal, a sub-query version of a query should take longer to run? My take is the sub-query version executes the subquery once for every Customer listed in the WHERE clause.In general, no. That may have been true in a much earlier version of SQL (pre 7), but the current optimiser can convert one into the other fairly well in most cases. The cases where it can't is when the correlation expression (what joins a subquery with the outer query) is an inequality or when there's a TOP 1 in the subquery. In those two cases (and I believe only those cases), a correlated subquery runs once per row of the outer query.
Unrelated maybe, but puzzling. WHERE 1=1?
Can you maybe post the exec plan of the two queries?
If this difference is no longer ture then I wonder why the DERIVED TABLE method is not at least on par with the SUBQUERY method instead of being far slower.
The WHERE 1 = 1 bit is just a shortcut I picked up from another deveoper. Adding it to the WHERE caluse makes it easy to add/remove additional crietria. With the 1 = 1 condition the results aren't changed (assuming no other conditions) and when you want to add a condition the syntax is:
AND|OR (Some_Condition)
This consistent syntax allowd us to more eaisly use the automated features in or text editor we use.
Kindest Regards,
Just say No to Facebook!June 19, 2009 at 12:14 pm
YSLGuru (6/19/2009)
If this difference is no longer ture then I wonder why the DERIVED TABLE method is not at least on par with the SUBQUERY method instead of being far slower.
Without seeing the exec plan, it's really hard to say why. If you can't post them, take a look at the two plans, see what the differences are, see what indexes they're using.
Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability
June 19, 2009 at 12:17 pm
Unrelated maybe, but puzzling. WHERE 1=1?
Allows additional conditions to be added, commented, etc while writing the query. Looks like a testing method so you don't have to worry about hang or missing conditionals. Just my observation.
June 19, 2009 at 12:53 pm
Ran your two SQL statements and there are differences in both the statistics io and query plans. But the two SQL statements are not logical equivalent. The Sub-Query will always return a row for the customer even when there are no Orders for the customer but the Derived table will only return rows when the customer has orders. Changing to a "left outer join" to the derived table caused the IO and query plans to be identical.
Your two SQL statements may get some performance improvement with two changes:
1. The customer table is referenced twice in each of the two alternative SQL statement but only one reference is needed.
2. In the Derived table SQL, restrict condition "Customers.ContactName = 'Mario Pontes'" appears twice but only needs to appear once.
SELECT C.ContactName
, C.CompanyName
,COALESCE(CCD.CustomerTypeID,'N/A')
, (SELECT Sum(O1.OrderID)
FROM Orders O1
WHERE 1 = 1
AND O1.CustomerID = C.CustomerID
) AS 'iNumOrders'
FROM dbo.Customers C
Left Outer Join
CustomerCustomerDemo CCD
ON C.CustomerID = CCD.CustomerID
WHERE 1 = 1
AND C.ContactName = 'Mario Pontes'
SELECT Customers.ContactName
, Customers.CompanyName
,COALESCE(CCD.CustomerTypeID,'N/A')
,Orders.iNumOrders
FROM dbo.Customers
Left Outer Join
CustomerCustomerDemo CCD
ON Customers.CustomerID = CCD.CustomerID
Left Outer Join
(SELECT Orders.CustomerID
,Sum(Orders.OrderID) as iNumOrders
FROM Orders
group by Orders.CustomerID
) as Orders
on Orders.CustomerID = Customers.CustomerID
WHERE 1 = 1
AND Customers.ContactName = 'Mario Pontes'
SQL = Scarcely Qualifies as a Language
June 19, 2009 at 1:53 pm
Carl Federl (6/19/2009)
Ran your two SQL statements and there are differences in both the statistics io and query plans. But the two SQL statements are not logical equivalent. The Sub-Query will always return a row for the customer even when there are no Orders for the customer but the Derived table will only return rows when the customer has orders. Changing to a "left outer join" to the derived table caused ....[/code]
Yeah when I put this example together I was going more for conveying the issue without providing the actual T-SQL code we use because of our NDA (I hate that thing) and so a few things like the Left Outer Join vs Join got over looked but you are correct. The actual query I'm working on returns the same date on the left side regardless of whether there is data on the right side fo a Join in both the Subquery verison and the Derived table version.
Thanks for replying
Kindest Regards,
Just say No to Facebook!June 19, 2009 at 2:00 pm
UPDATE: More on this ....
Just as an FYI I have discovered thru testing that the differences between the 2 methods greatly lessens with each increase of the number of pieces of data included in the SELECT. My example has just 1 aggregation in it but when I add (in my real query and nothe sample from Adventureworks) more field aggregations to the query, the subquery version takes longer to run and results in more reads where as the Derived Table method has little to no noticeable change.
Perhaps the lesson here is that when it comes to Dervied Table vs Subquery, the Derived Table method is better over the larger number of fields aggregated where as the subquery method is better when looking at 1 or 2 fields aggregated??
I also asked Itzik Ben-Gan his 2 cents on this and in short the answer is it just depends. The number of variables are 2 extensive/great to have a general rule of which one of these is better. One thing that did come form him that I'm gonna test with is looking into use of the OVER Clause; it might add some kick since it's something new. At a minimum it will allow me to reduce the number of steps taken in the rpeort over all that uses this query I'm working on.
Thanks again to all for chimming in.
Kindest Regards,
Just say No to Facebook!Viewing 9 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply