Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

SELECT, Deconstructed

Today let’s expand on the logical processing order of SELECT that I mentioned in last week’s N Things Worth Knowing About SELECT blog.

We’re looking at the SELECT statement clauses in the order that the SQL engine logically processes them…we’ll even write it that way – it’ll look weird, but we’ll be reading it like the SQL engine does.

You can find more on SELECT’s logical processing order in BOL, and in Itzik Ben-Gan’s T-SQL Fundamentals and T-SQL Programming books.

Here’s our basic T-SQL query (using AdventureWorks):

SELECT P.Name ,
  P.ProductNumber ,
  P.Color ,
  P.StandardCost ,
  SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL

If we get all pseudo-mathematical on this, we can say that there’s some theoretical set of rows that we’ll get back out of this query; each logical step – in order – further refines that set of rows until we get back the actual rowset we want.

FROM

The SQL engine starts with the FROM clause first, to see where the data’s coming from. In this case we’re pulling from the Product table…easy. So, the theoretical rowset right now is everything in the Product table. (You can think about each step passing on that theoretical rowset to the next step for further refinement.)

-- Logical order:

FROM Production.Product P

ON/JOIN

Next we look at the ON and JOIN clauses: JOIN tells us what data structure (remember, it doesn’t have to be a table) we’re hooking up with, and ON give us the criteria for hooking our result sets together. We’re still building our theoretical resultset here; a JOIN can either expand the resultset, or limit it.

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC

Notice that the ON clause is evaluated before the WHERE clause; so we have to understand that where we put a limiting clause (like Color=’black’) can make a huge difference in our resultset. In this case, including the search condition “color=’black’” in the ON clause makes no difference, because it’s a LEFT OUTER JOIN…it’s going to return all the rows from the left side – black, red, blue, etc. – regardless of the join condition. Go on, try it:

SELECT P.Name ,
  P.ProductNumber ,
  P.Color ,
  P.StandardCost ,
  SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
AND P.color='black'
WHERE P.ProductSubCategoryID IS NOT NULL

Toldya. In the ON clause, we’re comparing resultsets to make a match in a specific way. In the WHERE clause, we’re limiting the resultset that we got. Now, if this were an inner join, we could include the color=’black’ in the ON clause; it would limit the resultset just as if we’d said it IN WHERE. But I’m still more likely to put that condition in the WHERE clause; it’s more “proper”, in the sense that if that query ever needs to change to an outer join, the condition belongs in WHERE.

WHERE

Speaking of where…this is clearly the place where we’d want to whittle our resultset down to just the rows that we really want.

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'

In this statement, it’s the last refinement we make to our resultset, so next the SQL engine would evaluate the SELECT clause to see what columns to pull, and our logically-ordered query would look like this:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
SELECT P.Name ,
  P.ProductNumber ,
  P.Color ,
  P.StandardCost ,
  SC.Name [Category]

But, there’s a lot more we can do in a SELECT statement, so let’s continue with a more complex query that includes GROUP BY, HAVING, and ORDER BY:

SELECT P.Color ,
  COUNT(*) ItemCount,
  MIN(P.StandardCost) MinCost ,
  MAX(P.StandardCost) MaxCost ,
  AVG(P.StandardCost) AvgCost ,
  SC.Name [Category]
FROM Production.Product P
LEFT OUTER JOIN Production.ProductSubCategory SC ON P.ProductSubCategoryID = SC.ProductSubCategoryID
WHERE P.ProductSubCategoryID IS NOT NULL
AND P.color='black'
AND AVG(P.StandardCost) > 100
GROUP BY SC.Name, P.color
HAVING AvgCost > 100

GROUP BY

After SQL gets the base rowset down (by evaluating FROM, JOIN/ON, and WHERE), it looks at the GROUP clause to see if we’re going to lump some rows together. Notice, by the way, that we STILL haven’t looked at the actual SELECT clause:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color

Today we want some aggregate data about each category, so we’ll GROUP BY the category name and color – when grouping, anything that’s not an aggregate (SUM, MIN, MAX, COUNT, etc) in the select list, MUST to be in the GROUP BY clause.

HAVING

A quick bit of trivia: the HAVING clause doesn’t require a GROUP BY clause; in that case, it just behaves like a WHERE. Oh, and speaking of WHERE…why couldn’t we just put the “AND AVG(P.StandardCost) > 100″ in the WHERE clause? Isn’t it a limiting factor, just like Color and subcategoryID?

Well, yeah it is, but remember: SQL evaluated WHERE, then GROUP BY (the aggregation clause), and THEN the HAVING clause. The aggregate isn’t allowed in the WHERE clause, because at that point, SQL hadn’t heard anything about grouping data…all the rows were still autonomous. So, aggregates go in HAVING:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color
HAVING AvgCost > 100

SELECT

After HAVING, SQL finally takes a look at the SELECT clause itself, and our theoretical resultset becomes an actual resultset.

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color
HAVING AvgCost > 100
SELECT P.Color ,
COUNT(*) ItemCount,
MIN(P.StandardCost) MinCost ,
MAX(P.StandardCost) MaxCost ,
AVG(P.StandardCost) AvgCost ,
SC.Name [Category]

DISTINCT, ORDER BY, TOP

From here, SQL will process DISTINCT (if we included it), then ORDER BY (if we included it), then TOP (ditto). Let’s go ahead and include an ORDER BY:

-- Logical order:

FROM Production.Product P
ON P.ProductSubCategoryID = SC.ProductSubCategoryID
LEFT OUTER JOIN Production.ProductSubCategory SC
WHERE P.ProductSubCategoryID IS NOT NULL AND Color='black'
GROUP BY SC.Name, P.color
HAVING AvgCost > 100
SELECT P.Color ,
  COUNT(*) ItemCount,
  MIN(P.StandardCost) MinCost ,
  MAX(P.StandardCost) MaxCost ,
  AVG(P.StandardCost) AvgCost ,
  SC.Name [Category]
ORDER BY AvgCost

Here’s a big mystery revealed: THIS is why the ORDER BY clause is the ONLY clause that can refer to column aliases: ORDER BY is the only full clause that’s evaluated after the aliases are! Put it this way: we read the query top down, while SQL saves the top for nearly last. When you shout an alias name at it in your WHERE, SQL has NO IDEA what on earth you’re talking about.

And….there you go. Happy days,

Jen

http://www.MidnightDBA.com/Jen

Comments

Posted by Dr. Diana Dee on 4 September 2010

In your "more complex" query, because HAVING is evaluated before SELECT, column aliases such as AvgCost cannot be used.  You need:  HAVING AVG(p.StandardCost) > 100.  I know you knew that, because, as you said, ORDER BY is the only clause in which you can use column aliases.

Posted by amit.kmr on 5 September 2010

The same as Dr. Diana Dee said, how can be a column alias used with HAVING clause.

Posted by Charles Kincaid on 5 September 2010

I have seen explanations about JOIN processing and they have improved my query writing quite a bit.  On INNER JOIN moving the predicate to the JOIN out of the WHERE can produce astounding performance gains.

This post puts it all together in an easy to understand way.  Excellent job.  I can use this at work as a teaching tool.  Thanks.

Posted by Megistal on 7 September 2010

For more details why it is touchy having an AND clause inside a OUTER JOIN:

www.sqlskills.com/.../Determining-the-position-of-search-arguments-in-a-join.aspx

Posted by Jen McCown on 7 September 2010

Diana, Amit:

You're right! I noticed that after publishing. That'll teach me not to retest code between edit and publish :)

Posted by Jen McCown on 7 September 2010

Charles, thanks!  And good point...I have seen the same thing when using INNER JOINs.

Posted by gitmo on 7 September 2010

Charles Kincaid:

Can you give an example where " On INNER JOIN moving the predicate to the JOIN out of the WHERE can produce astounding performance gains."

I'd be interested in investigating that.

Posted by gitmo on 10 September 2010

Good article. It explains a lot of seemingly arbitrary rules around constructing SQL statements.

If ON is evaluated before the JOIN, how does it resolve table aliases in the ON statement.  They aren't defined until the JOIN statement.

Posted by helmut.schuster on 13 September 2010

When I read the article I was reminded of the syntax that LINQ for ADO.NET 3.5 uses.  Below is an example.  See the sequence of the FROM, WHERE, ORDERBY and SELECT clauses. The sequence is apparently required for Intellisense support.

var invoices = from invoice in invoiceList

              where invoice.InvoiceTotal > 20000

              orderby invoice.InvoiceTotal descending

              select invoice;

Posted by Jen on 17 September 2010

Gitmo,

Thanks for your comments. I'll get back to you on the performance gains (it's Friday!), but you sparked my interest in the ON/JOIN precedence issue.  I've answered that question in a new blog: "SELECT: If ON preceeds JOIN, then how…" (www.midnightdba.com/.../select-if-on-preceeds-join-then-how)

Leave a Comment

Please register or log in to leave a comment.