Sub query expressions vs. Joins.

Question

Post reply

Sub query expressions vs. Joins.

brian118

Hall of Fame

Points: 3506
More actions
July 21, 2010 at 9:14 pm

#225240

Comments posted to this topic are about the item Sub query expressions vs. Joins.
Regards
Brian Ellul
----------------------------------------------------------------------------------
Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination.
- Albert Einstein -

Viewing 15 posts - 1 through 15 (of 56 total)

You must be logged in to reply to this topic. Login to reply

miroslav 33997 SSC Rookie Points: 33 More actions · Answer 1

miroslav 33997

SSC Rookie

Points: 33

July 21, 2010 at 9:33 pm

#1196617

[x] Reside

tim.napier SSC Rookie Points: 37 More actions · Answer 2

What happens if you use a left join in the second of the two queries. The fact they both returned 40000 rows is by coincidence, cos the sub select would returns if their isn't an equi joiin

crackbridge SSC Enthusiast Points: 185 More actions · Answer 3

Its interesting your article, one thing its not clear enought to me is why do you open your article with a GROUP and COUNT issue but later you develop a SELECT JOIN for 1-m topic ?

The first, taking case of GROUP BY, it would be iteresting to know how it works if you want to calculate for each ManagerId, two clumns, COUNT of workers with SALARY>1500 and COUNT of workers with SALARY<=1500, not using a CASE, but using a SUBQUERY to count vs. any other approach.

Related to the analysis of the execution plant its very helpful your comments and the bibliography you have referenced.

One last advice, not to use classic SELF linked case of EMPLOYEE-MANAGER as data sample since its not normalized as two tables but one table with references to it self.

Thanks,

L.

paul-876346 SSC Rookie Points: 35 More actions · Answer 4

Thank you - I also wonder how the sub-query or join would compare to Cross (or Outer) Apply - as in the following query?

SELECT EmployeeID, Salary, RActive, CAM.ManagerName FROM Employees

CROSS APPLY (SELECT ManagerName FROM Managers WHERE Managers.ManagerID = Employees.ManagerID) CAM

WHERE Employees.ManagerID = 1

bayudw SSC Journeyman Points: 84 More actions · Answer 5

What about vice versa condition?

For example, I have two tables. First, a table for list of accounts. Second, a table for list of account's transactions.

I want to get the last transaction amount from second table for each of account in the first table.

How about that? Which one of those queries I might choose?

Thanks.

Arjun Sivadasan SSCertifiable Points: 5902 More actions · Answer 6

Brian, Thanks for the article. I have a question here. I am not gonna post the DDL scripts as I have used your sample tables.

Consider the following statement that you had posted:

Select COUNT(*),ManagerID

From Employees

Group By ManagerID

Order By ManagerID

Now say, I need a third column which shows all the subordinates (IDs in this case) of a manager as csv. Can this be done in a better way than this?

select

e.ManagerID,

COUNT(*) [No of Subs],

(select(stuff((select ',' + cast(EmployeeID as varchar)

from Employees where managerid = e.managerid

for xml path('')),1,1,''))) [SubIDs]

from Employees e

group By e.ManagerID

order By e.ManagerID

I know the requirement (showing IDs as csv) is a little weird but I had to get similar output in some reports. Please comment.

- arjun

https://sqlroadie.com/

Arjun Sivadasan SSCertifiable Points: 5902 More actions · Answer 7

bayudw (7/22/2010)
What about vice versa condition?
For example, I have two tables. First, a table for list of accounts. Second, a table for list of account's transactions.
I want to get the last transaction amount from second table for each of account in the first table.
How about that? Which one of those queries I might choose?
Thanks.

I have assumed this schema. AccountID and TransactionID should be primary keys. AccountID should be a foreign key in #Transaction. You can create non-clustered indexes on #Transaction.AccountID and #Transaction.TransactionDate. Alternately, you can also use TransactionID column for orderby clause of row_number.

You should have posted this as a separate thread. I can't see any relation to the topic.

--I have two tables. First, a table for list of accounts. Second, a table for list of accounts transactions.

--I want to get the last transaction amount from second table for each of account in the first table.

create table #Account(AccountID int identity(1,1), AccountName char(2))

create table #Transaction(TransactionID int identity(1,1), AccountID int, TransactionAmount float, TransactionDate datetime)

insert into #Account

select 'A1' union

select 'A2' union

select 'A3'

insert into #Transaction

select 1, 145.50, dateadd(mi,-10,getdate())

insert into #Transaction

select 1, 125.50, getdate()

insert into #Transaction

select 2, 245.50, dateadd(mi,-10,getdate())

insert into #Transaction

select 2, 225.50, dateadd(mi,-5,getdate())

insert into #Transaction

select 2, 205.50, getdate()

insert into #Transaction

select 3, 345.50, dateadd(mi,-10,getdate())

select * from #Account

select * from #Transaction

;with cte as

(

select *,

row_number() over(partition by AccountID order by TransactionDate desc) [RN]

from #Transaction

)

select * from cte where RN = 1

drop table #Account

drop table #Transaction

- arjun

https://sqlroadie.com/

jainraj_2000 Grasshopper Points: 15 More actions · Answer 8

The Script is not working in SQL SERVER 2005 the populate tables procedure is having an insert into managers with multiple values.

In order to run it correctly it should be changed to

INSERT INTO Managers (ManagerName)VALUES ('Manager1');

INSERT INTO Managers (ManagerName) VALUES('Manager2');

INSERT INTO Managers (ManagerName)VALUES('Manager3');

Regards

Rajendra

bayudw SSC Journeyman Points: 84 More actions · Answer 9

Thanks for your replay.

I use this query below :

Select A.AccountID, A.OpenDate,

'Amount'=(Select Top 1 TransactionAmount From Transaction Where AccountID=A.AccountID order by TransactionDate desc)

From Account A

or

any idea else ?

Arjun Sivadasan SSCertifiable Points: 5902 More actions · Answer 10

bayudw, i think the row_number() solution will be faster than the query you currently use. check it out and share the execution time/plan.

- arjun

https://sqlroadie.com/

Johannes Krackowizer Grasshopper Points: 19 More actions · Answer 11

There is a big flaw in the article, you take two different statements with the same result but only by accident. The subselect ist quite the same as an outer join. The second statement uses an inner join which is in this example the same because ther is no employe without manager. But if there would be an employe without manager the statments would have different results. so it depends on what you try to accomplish. If you have two tables that could be inner joined, the inner join will be better than a subselect. But if you have to do an outer join, than the subselect will be mostly equal to the join. A third scenario could bethat you need an outer join AND more than one column from the subselect (multiple subselects), than the outer join will be much better than the subselect.

So there is a big difference in which scenario you are using subselects. in some scenarios there will be virtually no difference between both approches.

And if you are using oracle databases than all of this is the exact opposite than in mssql. 😉 I tried this example on my oracle db and found out that an subselect has only half the costs than the join even if i do multiple subselects.

glen.Gorman Valued Member Points: 58 More actions · Answer 12

I was thinking something is missing from this article, and whats missing is the reasoning why joins are generally better than sub queries. The reasoning give you more knowledge of why and when you should use joins as opposed sub queries. It would be erroneous otherwise.

Having an Oracle background the answer comes to me quite fast. They are better because they can be used by the optimizing compiler to reduce the amount of data that the database has to go though to get the answer.

Thus if the solution requires that the database go though all of the data then a sub-query might be better because a join may force the use of an index, and in that case you are traversing the index and all of the data, such as joining a large and small data set in addition to requiring all of the data from the larger data set.

Mike Carey-249880 SSC Journeyman Points: 97 More actions · Answer 13

would like to see how these queries perform using CTE instead

Jeff Moden SSC Guru Points: 1003873 More actions · Answer 14

glen.Gorman (7/22/2010)
I was thinking something is missing from this article, and whats missing is the reasoning why joins are generally better than sub queries. The reasoning give you more knowledge of why and when you should use joins as opposed sub queries. It would be erroneous otherwise.
Having an Oracle background the answer comes to me quite fast. They are better because they can be used by the optimizing compiler to reduce the amount of data that the database has to go though to get the answer.
Thus if the solution requires that the database go though all of the data then a sub-query might be better because a join may force the use of an index, and in that case you are traversing the index and all of the data, such as joining a large and small data set in addition to requiring all of the data from the larger data set.

Maybe that's true in Oracle but it's not true in SQL Server. Oracle is mostly rule based and SQL Server is cost based. The other thing is that the Execution Plan in SQL Server lies like a rug when it comes to real cost and Grant Fritchey used one of my examples in his Apress book to demonstrate. You should only use the execution plan as a guide, not as gospel when it comes to performance. Here's an example similar to what Grant used in his book...

/****************************************************************************************

Purpose:

This code demonstrates that the estimated and actual execution plans in SQL Server can

be 100% INCORRECT and that the execution plan should only be relied on to provide hints

as to what may be wrong with a query rather than an absolute indication. This code runs

in SQL Server 2005 only.

The code creates a temp table for 10 years worth of dates starting with 2000-01-01 using

two different methods. The first method uses a recursive CTE and the second method uses

a "Tally" table. The output of each method is directed to a "throw-away" variable to

take display delays out of the picture.

Please check both the actual and estimated execution plans and compare the % of batch.

Please see the following article on how to build a Tally table and how they can be used

to replace certain While Loops.

http://www.sqlservercentral.com/articles/T-SQL/62867/

****************************************************************************************/

SET NOCOUNT ON

--=======================================================================================

-- Recursive method shown by (Name with-held)

--=======================================================================================

PRINT '========== Recursive method =========='

--===== Turn on some performance counters ===============================================

SET STATISTICS IO ON

SET STATISTICS TIME ON

DECLARE @Bitbucket DATETIME --Holds display output so display times aren't measured.

--===== Execute the code being tested ===================================================

DECLARE @DateVal DATETIME

SET @DateVal = '2000-01-01'

;with mycte as

(

select @DateVal AS DateVal

union all

select DateVal + 1

from mycte

where DateVal + 1 < DATEADD(yy, 10, @DateVal)

)

select @Bitbucket = d.dateval

from mycte d

OPTION (MAXRECURSION 0)

--===== Turn off the performance counters and print a separator =========================

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

PRINT REPLICATE('=',90)

GO

--=======================================================================================

-- Tally table method by Jeff Moden

--=======================================================================================

PRINT '========== Tally table method =========='

--===== Turn on some performance counters ===============================================

SET STATISTICS IO ON

SET STATISTICS TIME ON

DECLARE @Bitbucket DATETIME --Holds display output so display times aren't measured.

--===== Execute the code being tested ===================================================

DECLARE @StartDate AS DATETIME

SET @StartDate = '2000-01-01'

SELECT TOP (DATEDIFF(dd,@StartDate,DATEADD(yy,10,@StartDate)))

@Bitbucket = @StartDate-1+t.N

FROM dbo.Tally t

ORDER BY N

--===== Turn off the performance counters and print a separator =========================

SET STATISTICS TIME OFF

SET STATISTICS IO OFF

PRINT REPLICATE('=',90)

GO

In the case of this article, a similar problem occurs simply because the author didn't test identically functioning code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)