Common Table Expressions

  • Tao:

    The clarity thing is totally a personal preference. To me, being able to look at the query at the end, especially in things like complex updates, and have that query be only a couple of lines, is more clear than even a well-formatted derived table. But that's for me. I can read either one, I just find the CTE easier. And, as you mention, that's mainly because I've been using them for a while now.

    The other point, using them for consistency, is where I think it really matters. If you use them at all, use them as exclusively as possible. Same as any other standard.

    fenomenales:

    In the example you give, no, there isn't a clear advantage that I'm aware of for the CTE over the temp table.

    As with derived tables, the real difference is when you want one that's recursive (like a hierarchy or bill of materials).

    Temp tables, like CTEs and unlike derived tables, can be referenced more than once in a query. Temp tables also have the advantage that they can be referenced in more than one query, unlike both CTEs and derived tables. That's significant in many cases.

    CTEs, like temp tables, table variables, and derived tables, all reside in memory till they get too big, then they get dumped into the tempdb. (I've tested all of these and they do. You should have seen the 200 Gig tempdb I ended up with from one runaway test. Which is why I do that testing on a desktop box instead of a production server.:) ) So there's no advantage of one over the other in that regard.

    Temp tables can also generate statistics, and can be indexed. Again, that can be an important advantage.

    Temp tables, on the other hand, aren't recursive. They also require more code to set up, and can easily result in procs that have to recompile every time they are run. (In either the case of mixing DDL and DML, or the case of running a cursor on a temp table, the proc will have to recompile every time it's called. This results in a compile lock for every run, which can result in serious delays for user queries.) CTEs have the advantage there.

    It's all a matter of knowing which tool to use for which job.

    Use a temp table if more than one query in a proc/script will reference the data. (A table variable works the same way, but table variable vs temp table is a whole discussion all by itself.) Use a CTE if it will be recursive, or only referenced once. Use a CTE to populate a temp table if you need both recursion and multiple references. And so on.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Tao Klerks (3/18/2008)


    Thanks GSquared - this makes sense to me.

    I don't think I agree about the clarity of using CTEs for derived tables, but I guess that might be because I still live in a SQL Server 2000 world (with very strict formatting standards).

    I think I'd rather have:

    SELECT Column1

    FROM (

    SELECT Column1

    FROM (

    SELECT Column1, Column2

    FROM SomeTable

    WHERE Column2 = 3

    ) AS Derived1

    WHERE Column1 = 2

    ) AS Derived2

    WHERE 1 = 1

    than

    WITH Derived1 AS (

    SELECT Column1, Column2

    FROM SomeTable

    WHERE Column2 = 4

    ),

    Derived2 AS (

    SELECT Column1

    FROM Derived1

    WHERE Column1 = 3

    )

    SELECT Column1

    FROM Derived2

    WHERE 1 = 1

    but if I understand correctly that is really a matter of taste / formatting rather than any logical difference (and I just checked the query plan for good measure :))

    I know they're just examples, but a simple query would solve the both.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • For me a CTE is a lot easier to follow. And the CTE is reusable within the same statement, where a derived table is not. (I love CTE btw) I also have seen indexes being used in a CTE when they were not used in a derived table (but that could have been a side effect of rewriting the query better I suppose?).

    Changing the simple example given to use a CTE shows an identical execution plan for me.

    ;with Data as (

    SELECT dept_id, count(*) AS n

    FROM employee

    GROUP BY dept_id

    )

    select top 1 dept_id, n

    from Data

    order by n DESC

    And one little (overlooked?) thing about a multiple CTE, if a resultset is not referenced it appears to not be ran (according to the execution plan). Could I get some feedback on this?

    Example, with this I only see TableB being shown in the execution plan :

    ;with

    DataA as ( select a from TableA )

    ,DataB as ( select b from TableB )

    ,DataC as ( select c from TableC )

    select

    b

    from DataB

    Based on what I've coded, I would strongly urge the move to CTE instead of derived tables. 🙂

    Tao Klerks (3/18/2008)


    I see the value of CTEs, but I'm not sure I agree with the example... wouldn't it be better written (more efficient?) using a single aggregate query and a TOP clause?

    SELECT TOP 1 dept_id, n

    FROM (

    SELECT dept_id, count(*) AS n

    FROM employee

    GROUP BY dept_id

    ) AS a

    ORDER BY n DESC

    I guess the question for me is: while CTEs provide a nice syntax for repeated and recursive expressions, is the SQL optimizer actually able to use them as efficiently as a statement designed to avoid repetitive expressions in the first place?

    I tried this on a very simple dataset (comparing all three statements), and found that the total cost of the queries, in all three cases, was the same. The query plans were slightly more complicated for the original statement and CTE statement, and slightly simpler for the statement above (single Sort/TopN Sort instead of Sort -> Filter -> Top) - unfortunately I don't have a large dataset to test on!

    My instinct would be: Try to rewrite your statement to use joins and derived tables rather than subqueries, before moving to CTEs as a way of simplifying your subqueries.

    Does anyone know better one way or the other?

  • Excellent article!

    Will put to use immediately!

    BTW, in the sentence "I have found them particularly handy when cleaning up demoralized tables. ",

    did you mean "I have found them particularly handy when cleaning up denormalized tables."?

  • Robert (3/18/2008)


    As you can see CTEs can be very useful. I have found them particularly handy when cleaning up demoralized tables

    What are "demoralized tables"? I didn't know they can have such character. 😉

    Could it have been a Freudian slip ? 😛

  • I have to agree with GSquared and Jeff. I like CTE's as well. when you have a complex query that uses a derived table, it makes the code cleaner by putting the derived table in a CTE (especially true when the derived table is also complex).

    I have used CTE's to help me build queries fro the bottom up. I know I need a "derived table", so I build that query, test it to be sure I am getting what I need. Turn that into a CTE, then build up from there.

    By building a complex query this way, I can incrementally reach my destination. Then its a matter of tuning the query from there, and yes, I have had to make changes, but at least I have a query that returns the correct result set to compare to as I tune the query. Helps reduce errors.

    😎

  • I guess I am just looking for a better example. I see this problem and it makes me think the best solution is:

    SELECT Top 1 dept_id, count(*) AS n

    FROM employee

    GROUP BY dept_id

    ORDER BY count(*) DESC

  • Is it possible to put an index on the CTE? I ask because we're moving to SQL Server 2005 in 4Q 08, and I have more than a few sp that reference and self-link a temp table multiple times that we currently index. (Moved from derived table to reduce processing time)

    The most recent instance for which we used this solution was in identifying all ER visits for members who had 2 visits within 90 days with a particular diagnosis range. While we could do this with a CTE, I'm not clear whether it would be able to do the required links efficiently without using a temp table with index.

    SELECT e.memberid

    , e.firstservicedate AS DOS

    , e.providerid AS Prov

    FROM (SELECT DISTINCT e1.memberid

    FROM #tempERSubsetDiab e1 JOIN #tempERSubsetDiab AS e2

    ON e1.memberid=e2.memberid

    AND (e2.firstservicedate BETWEEN e1.firstservicedate+1 AND e1.firstservicedate+90

    or (e1.firstservicedate=e2.firstservicedate AND e1.providerid>e2.providerid))) t

    JOIN #tempERSubsetDiab e

    ON t.memberid=e.memberid

    #tempERSubsetDiab contains is indexed on memberid, resulting in time savings relative to the derived table method.


    Puto me cogitare, ergo puto me esse.
    I think that I think, therefore I think that I am.

  • You can't index a CTE, just like you can't index a derived table. It should, however, use the appropriate index(es) of the table(s) referenced in the CTE.

    😎

  • Like others, I love CTEs and use them a lot as an alternative to derived tables. As stated, CTEs are great when the dataset is needed more than once within the same query. But they are also great when the dataset is needed only once for the reasons stated in other posts.

    I think I can give a more concrete reason for why CTEs might give cleaner/easier to read code (once you get used to the syntax). I believe it is a very good practice to proceed column names with a table alias. It helps to know where a column is coming from, but it is not necessary to repeat the entire table name. A good alias is cleaner. For example, I like this code:

    [font="Courier New"]SELECT mt.Column1, ot.Column2

    FROM MyTable mt

    JOIN OtherTable ot ON mt.JoinCol = ot.JoinCol[/font]

    When doing a derived table, you can give the table a name, but if you give it a fully expressive name and you want to clarify where your columns come from, you have to use a lot of text that is not so helpful in my opinion. When there are a lot of columns from a derived table, the reference to the table gets in the way. Here's just a simple example to illustrate basically what I am talking about:

    [font="Courier New"]SELECT bt.Column6, tblSummaryTable.SumCol

    FROM --this derived table gives us the summary data we are looking for.

    --table name is very clear, but is long and has to be used everywhere.

    (SELECT ft.Column3, max(ft.Column4) AS SumCol

    FROM FunTable ft

    GROUP BY ft.Column3) AS tblSummaryTable

    JOIN BoringTable bt ON tblSummaryTable.Column3 = bt.Column3[/font]

    Yes, I could have given the derived table a shorter name, but it would not have been as descriptive. Now, here is the same query as a CTE. In this case, I get to use both a descriptive name and a good table alias.

    [font="Courier New"] --This CTE gives us the summary data we are looking for.

    --Nice long descriptive name tells us what CTE does.

    ;With cteSummaryTable (Column3, SumCol) AS (

    SELECT ft.Column3, max(ft.Column4) AS SumCol

    FROM FunTable ft

    GROUP BY ft.Column3

    )

    SELECT bt.Column6, st.SumCol

    FROM cteSummaryTable st --nice short, but helpful alias

    JOIN BoringTable bt ON st.Column3 = bt.Column3[/font]

    Note: this is a point about style in syntax when using a derived table compared to a CTE. It is not an argument to say that the above example is best done using a CTE vs a temporary table or correlated subquery, etc. That's a different discussion than this point.

  • I'll stick to indexed temp tables for self-links then. (The base table used in this process has around 300k rows)


    Puto me cogitare, ergo puto me esse.
    I think that I think, therefore I think that I am.

  • CTE's are not for every situation. Since you are already creating and populating one or more temp tables, using CTE's in those sprocs may not be appropriate. You would have to try it out and see.

    😎

  • srienstr (3/18/2008)


    I'll stick to indexed temp tables for self-links then. (The base table used in this process has around 300k rows)

    If the underlying tables are correctly indexed, the CTE or Derived Table will use it... creation of the index on the temp table may take longer than using the CTE or Derived table in those cases.

    In the absence of the correct indexes, the Temp table may blow the doors off the CTE or Derived table... or not... it depends. If you really want performance, try both and pick one.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/18/2008)


    If the underlying tables are correctly indexed, the CTE or Derived Table will use it... creation of the index on the temp table may take longer than using the CTE or Derived table in those cases.

    So a CTE or Derived Table will inherit the indices from the parent table(s)? It hadn't seemed to me that would be the case from what I understood about how those are constructed... Then again, I have a bit less SQL experience than you. 😀 I'll give it a test run.


    Puto me cogitare, ergo puto me esse.
    I think that I think, therefore I think that I am.

  • Both are nothing more than "in-line" views... views can use indexes just like any query can. Same goes for CTE's and Derived tables... "Have Index, Will Compute". 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply