Introducing the MySQL common table expression

This article is part of Robert Sheldon's continuing series on Learning MySQL. To see all of the items in the series, click here.

As with many relational database management systems, MySQL provides a variety of methods for combining data in a data manipulation language (DML) statement. You can join multiple tables in a single query or add subqueries that pull data in from other tables. You can also access views and temporary tables from within a statement, often along with permanent tables.

MySQL also offers another valuable tool for working with data—the common table expression (CTE). A CTE is a named result set that you define in a WITH clause. The WITH clause is associated with a single DML statement but is created outside the statement. However, only that statement can access the result set.

In some cases, you can include a CTE with a SELECT statement that is embedded in another statement, as in the case of a subquery or a DELETE…SELECT statement. But even then, the WITH clause is defined outside of that SELECT statement, and only that SELECT statement can access the result set.

One way to think of a CTE is as a type of view with a very limited scope (one statement). Another way to think of a CTE is as a type of named subquery that is defined in a clause separate from the main query. However, a CTE is neither of these, and in this article, I explain how the CTE works and walk you through a number of examples that demonstrate the different ways you can use them to retrieve data.

Note: the examples use the tables and data that is created in the last section of the article titled: “Appendix: Preparing the demo objects and data”.

Getting started with common table expressions

A common table expression is defined inside a WITH clause. The clause precedes the main DML statement, which is sometimes referred to as the top-level statement. In addition, the clause can contain one or more CTE definitions, as shown in the following syntax:

If the WITH clause contains more than one CTE, you must separate them with commas, and you must assign a unique name to each CTE, although this applies only within the context of the WITH clause. For example, two SELECT statements can include CTEs with the same name because a CTE is limited to the scope of its associated top-level statement.

The CTE name is followed by one or more optional column names, then the AS keyword, and finally a SELECT query enclosed in parentheses. If you specify column names, their number must match the number of columns returned by the SELECT query. If you don’t specify column names, the column names returned by the SELECT query are used.

Common table expressions are typically used with SELECT statements. However, you can also use them with UPDATE and DELETE statements, following the same syntax as shown above. In addition, you can include CTEs with your subqueries when passing them into your outer statements. You can also use CTEs in statements that support the use of SELECT as part of the statement definition. For example, you can add a WITH clause to the SELECT query in an INSERT…SELECT statement or in a CREATE TABLE…SELECT statement.

For this article, I focus primarily on creating CTEs that use SELECT statements as their top-level statements because this is the most common way to use a CTE. This approach is also a good way to start learning about CTEs without putting any data at risk. You can then apply the fundamental principles you learn here to other types of statements as you become more comfortable with how CTEs work.

With that in mind, let’s start with a simple example. The following SELECT statement includes a WITH clause that defines a CTE named planes:

The SELECT query in the CTE retrieves all the planes in the airplanes table that have jet as the engine_type. The CTE’s result set is made up of the data returned by the SELECT query and can be accessed by the top-level SELECT statement.

The top-level statement retrieves the data directly from the CTE, similar to how a statement might retrieve data from a view. The main difference is that the view definition is persisted to the database and can be used by anyone with adequate privileges. The CTE, on the other hand, has a very limited scope and can be accessed only within the context of the top-level statement.

In this case, the top-level SELECT statement retrieves only the plane and max_weight columns from the CTE and orders the results by the max_weight column, in descending order. The following figure shows the results returned by the statement.

Of course, you can easily achieve the same results without the use of a CTE by querying the airplanes table directly:

However, I wanted to demonstrate the basic components that go into a CTE and how you can access that CTE from within the top-level SELECT statement. Both the CTE and top-level statement can certainly be much more complex—and usually are—but the principles remain the same.

Working with CTEs in the top-level SELECT statement

As mentioned earlier, a CTE is basically a named result set. When you query the CTE from within the top-level statement, the data is returned in a tabular format, similar to what you get when you query a view, permanent table, temporary table, or derived table (such as that produced by a subquery in a SELECT statement’s FROM clause). This means that you can work with the CTE in much the same way as you can those other object types. For example, one common approach to referencing a CTE within the top-level query is to join it with another table, as in the following example:

The WITH clause defines a single CTE named mfcs. The CTE’s SELECT query returns the manufacturer_id and manufacturer values from the manufacturers table. The top-level SELECT statement then joins the airplanes table to the mfcs CTE, based on the manufacturer_id column in each one. The following figure shows the statement’s results.

As this example demonstrates, you can treat the CTE much like any other table structure in your top-level query. However, just as you saw earlier, you can also recast this statement without a CTE by joining the airplanes table directly to the manufacturers table:

Because there is so little data, the difference in performance between the two statements is negligible, but that might not be the case with a much larger data set. However, as is often the case with MySQL, it can be difficult to know which approach is best without running both statements against a realistic data set. Even then there might not be a significant difference in performance, in which case, it might come down to developer preference.

As noted earlier, MySQL often supports multiple ways to achieve the same results, as the preceding examples demonstrate. Common table expressions can sometimes help simplify code and make it more readable, which is a big point in its favor, but performance should generally be the overriding consideration.

Comparing different approaches usually requires that you test them on a valid data set, in part, because it can be difficult to find specific recommendations when comparing approaches. Not many database developers, for example, would be willing to say that you should always use CTEs rather than inner joins in all circumstances, or vice versa.

That said, you might come across recommendations that are less rigid and are perhaps worth considering, such as when comparing CTEs with subqueries. For instance, a CTE is often considered to be a better option if your SELECT statement includes multiple subqueries retrieving the same data, as in the following example:

If this statement looks familiar to you, that’s because I pulled it from my previous article in this series, which covers subqueries. As you can see, the statement includes three subqueries, all of them the same, which can result in a fair amount of redundant processing effort, depending on how the database engine chooses to handle the query. The following figure shows the results returned by this statement.

Instead of using subqueries, you can achieve the same results by defining a CTE that retrieves the average max_weight value for each manufacturer. Then, in your top-level query, you can join the airplanes table to the CTE, basing the join on the manufacturer IDs, as shown in the following example:

In this case, the CTE specifies the column names to use for the result set, so the manufacturer_id values are returned as the id column. Additionally, the CTE groups the data in the manufacturers table by the manufacturer_id values and provides the average avg_weight value for each one.

The top-level query then joins the airplanes table to the CTE but limits the results to those airplanes with a max_weight value greater than the average weight returned by the CTE. Notice that, in place of the subqueries, the statement now uses the avg_weight column from the CTE.

Once again, the performance difference between these two approaches is negligible because we’re working with such a small data set. Only by running the statements against a more realist data set can we get a true picture of their performance differences. In my opinion, however, the CTE makes the code more readable, that is, it’s easier to follow the statement’s logic.

Defining multiple CTEs in one WITH clause

Up to this point, the examples in this article included only one CTE per WITH clause, but you can define multiple CTEs and reference any of them in your top-level statement. Just make sure that you assign different names to the CTEs and separate them with commas. For example, the following WITH clause defines three CTEs, which are all referenced in the top-level SELECT statement:

The three CTEs are similar except that they each pull data based on a different engine_type value. The top-level SELECT query then uses the UNION operator to join them altogether. (The UNION ALL operator combines the results from multiple SELECT statements into a single result set.) The following figures shows part of the results returned by this statement.

This is a fairly basic example, but it demonstrates the concept of defining multiple CTEs and referencing them in the top-level query. In this case, the three CTEs operate independently of each other, but you don’t always have to take this approach. For instance, the WITH clause in the following example also includes three CTEs, but in this case, the second CTE (mfc_avg) references the first CTE (mfcs), while the third CTE (pl_avg) stands alone:

As this example demonstrates, a CTE can reference a CTE that comes before it. However, this works in one direction only; a CTE cannot reference one that comes after it. In this case, the mfc_avg CTE joins the airplanes table to the mfcs CTE and groups the data based on the manufacturer_id value. The top-level query then retrieves data from this CTE, but returns only those rows with an avg_parking value greater than the average returned by the pl_avg CTE. The following figure shows the results returned by this statement.

Something worth emphasizing is that the WHERE clause in the top-level query includes a subquery that retrieves data from the pl_avg CTE. Not only does this point to the inherent flexibility of CTEs, but also to the fact that CTEs and subqueries are not mutually exclusive.

Working with recursive CTEs

One of the most useful aspects of the CTE is its ability to perform recursive queries. This type of CTE—known as the recursive CTE—is one that references itself within the CTE’s query. The WITH clause in a recursive CTE must include the RECURSIVE keyword, and the CTE’s query must include two parts that are separated by the UNION operator. The first (nonrecursive) part populates an initial row of data, and the second (recursive) part carries out the actual recursion based on that first row. Only the recursive part can refer to the CTE itself.

To help understand how this work, consider the following example, which generates a list of even numbers up to and including 20:

The name of the CTE is counter, and it returns only the val column. The nonrecursive part of the CTE’s query sets the value of the first row as 2, which is assigned to the val column. The recursive part of the query retrieves the data from the CTE but increments the val column by 2 with each iteration. The query continues to increment the val column by 2 as long as val is less than 20. The top-level SELECT statement then retrieves the data from the CTE, returning the results shown in the following figure.

Note: Since the recursive query says < 20, you might be inclined to think it would not return 20 in the output. But the iteration that returns 20 does occur, but it stops iterating because the value is not less than 20 anymore.

When building a recursive CTE, be aware that MySQL places several limitations on the recursive part. The second SELECT statement cannot contain aggregate functions, window functions, the DISTINCT keyword, or the GROUP BY or ORDER BY clause.

Performing a recursive query can be useful when working with hierarchical data. To demonstrate how this works, I used the following code to create and populate a table named airline_emps, which stores the IDs and job titles for a group of fictitious employees, along with the ID of the person each one reports to:

Everyone except for emp_id 1 (big boss) reports to another individual. For example, the overseer reports to the top exec, who in turn reports to the big boss. I included a SELECT statement along with the INSERT statement so you can verify the data after it’s been added to the table.

With this data in place, you can now create the following recursive CTE, which finds each person’s position level in the company and how that position fits into the reporting hierarchy:

Because this is a recursive CTE, it is separated in two parts that are connected with the UNION ALL operator. The nonrecursive part populates the first row, based on the NULL value in the reports_to column. This row is for the big boss, who is at the top of the hierarchy. The nonrecursive part also assigns the value 1 to the emp_tier column and assigns the emp_id value to the emp_path column, converting the value to the CHAR data type. The first row returned by the CTE looks similar to that shown in the following figure.

The recursive part of the CTE uses an inner join to match the airline_emps table to the emps CTE. The join is based on the reports_to column in the airline_emps table and the emp_id column in the emps CTE. The join condition makes it possible to recurse through each level of the reporting hierarchy, based on the reports_to value. The recursive part then increments the emp_tier column by 1 with each new level in the hierarchy.

For example, emp_id 2 (divisional boss) and emp_id 6 (top exec) both report directly to emp_id 1 (big boss), so the emp_tier column for these two rows is incremented by 1, resulting in a value of 2 for each row. This means that they’re both in the second tier of the employee hierarchy. The next layer in the hierarchy are those individuals who report to the divisional boss or top exec, so the emp_tier column for these rows is set to 3. This process continues until there are no tiers left.

During this process, the emp_path column is also updated in each row by concatenating the emp_id vales to provide a representation of the reporting hierarchy. For instance, the reports_to column for emp_id 9 will show that the organizer reports to emp_id 8 (team leader), who reports to emp_id 6 (top exec), who reports to emp_id 1 (big boss), with each layer separated by a forward slash. The following figure shows the data returned by the query.

The top-level SELECT statement retrieves data only from the CTE, without joining to any other tables. The statement also includes an ORDER BY clause that sorts the results first by the emp_tier column and then by the emp_id column.

Using CTEs with DML statements

Earlier in the article, I mentioned that you can use CTEs with statements other than SELECT. I also stated that my focus in this article was primarily on how the CTE is implemented with the SELECT statement. However, I want to show you at least one of the alternative forms so you get a sense of what that might look like (and to whet your appetite a bit).

The following example shows a CTE used with an UPDATE statement to modify the data in the airline_emps table created in the previous section:

The WITH clause and CTE work just like you saw in other examples. The clause includes a single CTE named rpts that retrieves the emp_id values for those employees who report to emp_id 8. The query returns the values 9 and 10.

The top-level UPDATE statement uses the data returned by the CTE to update the reports_to column to 7 for those two employees. The UPDATE statement’s WHERE clause includes a subquery that retrieves the data from the CTE, so the statement will update only those two rows.

After you run this update statement, you can rerun the SELECT statement from the previous section to verify the changes. I’ve included the statement here for your convenience:

The statement returns the results shown in the following figure.

Notice that employees 9 and 10 now show a reports_to value of 7. In addition, the emp_path value for each of the two rows has been updated to reflect the new reporting hierarchy.

Working with MySQL common table expressions

The CTE can be a powerful tool when querying and modifying data in your MySQL databases. Recursive CTEs can be particularly useful when working with self-referencing data, as the earlier examples demonstrated. But CTEs are not always intuitive, and you should have a good understanding of how they work before you start adding them to your current database code, especially if you want to use them to modify data. For this reason, I recommend that you also review the MySQL documentation on CTEs, along with any other resources you have available. The more time you invest upfront in learning about CTEs, the more effectively you’ll be able to utilize them in your MySQL queries.

Appendix: Preparing the demo objects and data

For the examples in this article, I created the travel database and added the manufacturers and airplanes to the database. If you want to try out these examples for yourself, start by running the following script on your MySQL instance:

After you’ve created the database, you can add sample data to the manufacturers table by running the following INSERT statement, which inserts seven rows:

I included a SELECT statement after the INSERT statement so you can confirm that the seven rows have been added to the table. The first row in the table has a manufacturer_id value of 101, and the subsequent rows are incremented by one. After you’ve populated the manufacturers table, you can run the following INSERT statement to add data to the airplanes table:

The INSERT statement uses the manufacturer_id values from the manufacturers table. These values provide the foreign key values needed for the manufacturer_id column in the airplanes table. In addition, the first row is assigned 1001 for the plane_id value, with the plane_id values for the other rows incremented accordingly. As with the previous INSERT statement, I’ve included a SELECT statement for confirming that the data has been properly added.