This is the first article in a new stairway series that will explore more advanced features of Transact SQL (TSQL). This stairway will contain a series of articles that will expand on the TSQL foundation that you learned in the prior two TSQL stairways:
This "Advanced Transact SQL" stairway will cover the following TSQL topics:
- Using the CROSS JOIN operator
- Using the APPLY Operator
- Understanding Common Table Expressions (CTE's)
- Record Level Processing using Transact-SQL Cursors
- Turning data on its Side by using PIVOT
- Turning Columns into Rows using UNPIVOT
- Ordering your Data Using Ranking Functions
- Managing Dates and Times with Functions
- Understanding the Variations of the OVER Clause
Readers of this stairway should already have a good understanding of how to query, update, insert and delete data from SQL Server tables. Additionally they should have a working knowledge of methods that can be used to control the flow of their TSQL code, as well has being able to test and manipulate data.
This stairway should help readers prepare for passing the Microsoft Certification exam 70-461: Querying Microsoft SQL Server 2012.
For the first installment of this new stairway series I will be discussing the CROSS JOIN operator.
Introduction to CROSS JOIN operator
The CROSS JOIN operator can be used to combine all the records in one data set to all the records in another data set. By using the CROSS JOIN operator between two sets of records you are creating what is known as a Cartesian product.
Here is a simple example of using CROSS JOIN operator to join two tables A and B:
SELECT * FROM A CROSS JOIN B
Note that when using a CROSS JOIN operator there is no join clause connecting the two tables like you would use when performing an INNER and OUTER JOIN operation between two tables.
You need to be aware that using the CROSS JOIN can produce a large record set. To explore this behavior let's look at two different examples of how big the resulting set will be from a CROSS JOIN operation. For the first example suppose you are cross joining two tables where table A has 10 rows and table B has 3 rows. The resulti set of a CROSS JOIN will be 10 times 3 or 30 rows. For the second example assume that table A has 10 million rows and table B has 3 million rows. How many rows will be in a CROSS JOIN result set between table A and B? That would be a whopping 30,000,000,000,000 rows. That is a lot of rows and it will take SQL Server a lot of time and a lot of resources to create that result set. Therefore you need to be careful when using the CROSS JOIN operator on large records sets.
Let's take a little closer look at using the CROSS JOIN operator by exploring a couple of examples.
Basic Example of Using CROSS JOIN
For the first couple of examples we will be joining two sample tables. The code in Listing 1 will be used to create these two sample tables. Make sure you run these scripts in a user data database and not in master.
CREATE TABLE Product (ID int, ProductName varchar(100), Cost money); CREATE TABLE SalesItem (ID int, SalesDate datetime, ProductID int, Qty int, TotalSalesAmt money); INSERT INTO Product VALUES (1,'Widget',21.99), (2,'Thingamajig',5.38), (3,'Watchamacallit',1.96); INSERT INTO SalesItem VALUES (1,'2014-10-1',1,1,21.99), (2,'2014-10-2',3,1,1.96), (3,'2014-10-3',3,10,19.60), (4,'2014-10-3',1,2,43.98), (5,'2014-10-3',1,2,43.98);
Listing 1: Sample tables for CROSS JOIN
For the first CROSS JOIN example, I will run the code in Listing 2.
SELECT * FROM Product CROSS JOIN SalesItem;
Listing 2: Simple CROSS JOIN Example
When I run the code in Listing 2 in a SQL Server Management Studio window, with my session setup to output the results in text I get the output in Report 1:
ID ProductName Cost ID SalesDate ProductID Qty TotalSalesAmt --- --------------------- -------- ---- ----------------------- --------- ---- --------------- 1 Widget 21.99 1 2014-10-01 00:00:00.000 1 1 21.99 1 Widget 21.99 2 2014-10-02 00:00:00.000 3 1 1.96 1 Widget 21.99 3 2014-10-03 00:00:00.000 3 10 19.60 1 Widget 21.99 4 2014-10-03 00:00:00.000 1 2 43.98 1 Widget 21.99 5 2014-10-03 00:00:00.000 1 2 43.98 2 Thingamajig 5.38 1 2014-10-01 00:00:00.000 1 1 21.99 2 Thingamajig 5.38 2 2014-10-02 00:00:00.000 3 1 1.96 2 Thingamajig 5.38 3 2014-10-03 00:00:00.000 3 10 19.60 2 Thingamajig 5.38 4 2014-10-03 00:00:00.000 1 2 43.98 2 Thingamajig 5.38 5 2014-10-03 00:00:00.000 1 2 43.98 3 Watchamacallit 1.96 1 2014-10-01 00:00:00.000 1 1 21.99 3 Watchamacallit 1.96 2 2014-10-02 00:00:00.000 3 1 1.96 3 Watchamacallit 1.96 3 2014-10-03 00:00:00.000 3 10 19.60 3 Watchamacallit 1.96 4 2014-10-03 00:00:00.000 1 2 43.98 3 Watchamacallit 1.96 5 2014-10-03 00:00:00.000 1 2 43.98
Report 1: Results when running Listing 2
If you review the results in Report 1 you can see that there are 15 different records. These first 5 records contain the column values from the first row of the Product table joined with the 5 different rows in the SalesItem table. The same is true for the 2 second and 3 row of the Product table. The total number of rows returned is the number of row in the Product table times the number of rows in the SalesItem table, which is 15 rows.
One reason why creating a Cartesian product might be useful is to generate test data. Suppose I wanted to generate a number of different Products using the date in my Product and SalesItem table. I could do that using a CROSS JOIN, as I have done in Listing 3:
SELECT ROW_NUMBER() OVER(ORDER BY ProductName DESC) AS ID, Product.ProductName + CAST(SalesItem.ID as varchar(2)) AS ProductName, (Product.Cost / SalesItem.ID) * 100 AS Cost FROM Product CROSS JOIN SalesItem;
Listing 3: Simple CROSS JOIN Example
When I run the code in Listing 3 I get the output in Report 2.
ID ProductName Cost ----- ----------------------------------------------------------- --------------------- 1 Widget1 2199.00 2 Widget2 1099.50 3 Widget3 733.00 4 Widget4 549.75 5 Widget5 439.80 6 Watchamacallit1 196.00 7 Watchamacallit2 98.00 8 Watchamacallit3 65.33 9 Watchamacallit4 49.00 10 Watchamacallit5 39.20 11 Thingamajig1 538.00 12 Thingamajig2 269.00 13 Thingamajig3 179.33 14 Thingamajig4 134.50 15 Thingamajig5 107.60
Report 2: Results when running Listing 3
As you can see by reviewing my code in Listing 3 I generated a number of rows which contain data similar to the data in my Product table. By using the ROW_NUMBER function I was able to generate a unique ID column on every row. Additionally I used the ID column from my SalesItem table to create the unique ProductName, and Cost column values. The number of rows produced are equal to the number of rows in the Product table times the number of rows in the SalesItem table.
The examples in the section so far have only performed a CROSS JOIN across two tables. You can use the CROSS JOIN operator to perform a CROSS JOIN operation across multiple tables. The example in Listing 4 creates a Cartesian product across three tables.
SELECT * FROM sys.tables CROSS JOIN sys.objects CROSS JOIN sys.sysusers;
Listing 4: Using CROSS JOIN operator to create a Cartesian product of three tables
The output from running Listing 4 has two different CROSS_JOIN operations. The Cartesian product created from this code will produce a result set that will have a total row count equal to the number of rows in sys.tables times the number of rows in sys.objects times the number of rows in sys.sysusers.
When a CROSS JOIN performs like an INNER JOIN
In the previous section I mentioned that when you use a CROSS JOIN operator it will produce a Cartesian product. This is not true all the time. When you use a WHERE clause that constrains the joining of the tables involved in a CROSS JOIN operation SQL Server doesn't create a Cartesian product. Instead it functions like a normal JOIN operation. To demonstrate this behavior, review the code in Listing 5.
SELECT * FROM Product P CROSS JOIN SalesItem S WHERE P.ID = S.ProductID; SELECT * FROM Product P INNER JOIN SalesItem S ON P.ID = S.ProductID;
Listing 5: Two SELECT statements that are equivalent.
The code in Listing 5 contains two SELECT statements. The first SELECT statement uses the CROSS JOIN operator and then uses a WHERE clause to defined how to join the two tables involved in the CROSS JOIN operation. The second SELECT statement uses a normal INNER JOIN operator with an ON clause to join the two tables. SQL Server's Query optimizer is smart enough to know that the first SELECT statement in Listing 5 can be re-written as an INNER JOIN. The optimizer knows it can re-write the query when a CROSS JOIN operation is used in conjunction with a WHERE clause that provides a join predicate between the two tables involved in the CROSS JOIN. Therefore the SQL Server engine generates the same execution plan for both SELECT statements in Listing 5. When you do not provide a WHERE constraint SQL Server doesn't know how to join the two tables involving in a CROSS JOIN operation so it creates a Cartesian product between the two sets associated with the CROSS JOIN operation.
Using CROSS JOIN to find products not sold
The examples found in the previous sections were to help you understand the CROSS JOIN operator and how to use it. One of the powers of using the CROSS JOIN operator is to use it to help find items in one table that don't have matching records in another table. For example suppose I want to report on the total quantity and total sales amount for each ProductName in my Product table for each date any one of my product items was sold. Since in my example every ProductName is not sold every day there is a sale, my reporting requirements mean I need to show a quantity of 0 and total sales amount of $0 for those products that have not been sold on a given day. This is where the CROSS JOIN operator in conjunction with a LEFT OUTER JOIN operation will help me identify those Items that have not been sold for a given day. The code to meet these report requirements can be found in listing 6:
SELECT S1.SalesDate, ProductName , ISNULL(Sum(S2.Qty),0) AS TotalQty , ISNULL(SUM(S2.TotalSalesAmt),0) AS TotalSales FROM Product P CROSS JOIN ( SELECT DISTINCT SalesDate FROM SalesItem ) S1 LEFT OUTER JOIN SalesItem S2 ON P.ID = S2.ProductID AND S1.SalesDate = S2.SalesDate GROUP BY S1.SalesDate, P.ProductName ORDER BY S1.SalesDate;
Listing 6: Find Products not sold using CROSS JOIN
Let me walk you through this code. I create a subquery that selects all the distinct SalesDate values. This subquery gives me all the dates in which there was a sale. I then CROSS JOIN that with my Product table. This allows me to create a Cartesian product between each SalesDate and each Product row. The set returned from the CROSS JOIN will have every value I need in the final result set except the sum of the Qty and TotalSalesAmt for each product sold. To get those summary values I perform a LEFT OUTER JOIN against the SalesItem table joining it with the Cartesian product I created with the CROSS JOIN operation. I performed this join based on the ProductID and SalesDate column. By using the LEFT OUTER JOIN each row in my Cartesian product would be returned and if there was a matching SalesDate record for the ProductID and SalesDate, the Qty and TotalSalesAmt values would be associate with the appropriate rows. The last thing this query did was to use the GROUP BY clause to summarize the Qty and TotalSalesAmount based on SalesDate and ProductName.
The CROSS JOIN operator which produces a Cartesian product has some performance aspects to consider. Because the SQL Engine needs to join every row in one set with every row in another set the result set can be quite large. If I do a CROSS JOIN one table that has 1,000,000 rows with another table that has 100,000 rows then my result set will have 1,000,000 X 100,000 rows, or 100,000,000,000 rows. That is a big result set and it is going to take SQL Server a lot of time to create it.
The CROSS JOIN operator can be a great solution for identifying a result set across all the possible combinations of two sets, like all the sales for all customers for each month, even when for some months some customers have no sales. When using the CROSS JOIN operator, you should try to minimize the size of the sets being cross joined if you want to optimize performance. For example, suppose I have a table that contains sales data for the last 2 months. If I want to produce a report that shows the customers that didn't have any sales for a month, then the way to identify the number of days in a month can drastically change the performance of my query. To demonstrate this let me first create a set of sales records for 1,000 customers for a two month period. I will do this using the code in Listing 7.
CREATE TABLE Cust (Id int, CustName varchar(20)); CREATE TABLE Sales (Id int identity ,CustID int ,SaleDate date ,SalesAmt money); SET NOCOUNT ON; DECLARE @I int = 0; DECLARE @Date date; WHILE @I < 1000 BEGIN SET @I = @I + 1; SET @Date = DATEADD(mm, -2, '2014-11-01'); INSERT INTO Cust VALUES (@I, 'Customer #' + right(cast(@I+100000 as varchar(6)),5)); WHILE @Date < '2014-11-01' BEGIN IF @I%7 > 0 INSERT INTO Sales (CustID, SaleDate, SalesAmt) VALUES (@I, @Date, 10.00); SET @Date = DATEADD(DD, 1, @Date); END END
Listing 7: TSQL to create sample data for performance test
The code in Listing 7 creates 2 months' worth of data for 1,000 different customers. This code adds no sales data for every 7th customer. This code produces 1,000 Cust table records and 52,338 Sales table records.
To demonstrate how using the CROSS JOIN operator performs differently depending on the size of the sets being used in the CROSS JOIN input sets let me run the code in Listing 8 and Listing 9. For each test I will record the time it takes to return the results.
SELECT CONVERT(CHAR(6),S1.SaleDate,112) AS SalesMonth, C.CustName, ISNULL(SUM(S2.SalesAmt),0) AS TotalSales FROM Cust C CROSS JOIN ( SELECT SaleDate FROM Sales ) AS S1 LEFT OUTER JOIN Sales S2 ON C.ID = S2.CustID AND S1.SaleDate = S2.SaleDate GROUP BY CONVERT(CHAR(6),S1.SaleDate,112),C.CustName HAVING ISNULL(SUM(S2.SalesAmt),0) = 0 ORDER BY CONVERT(CHAR(6),S1.SaleDate,112),C.CustName
Listing 8: CROSS JOIN against all Sales records
SELECT CONVERT(CHAR(6),S1.SaleDate,112) AS SalesMonth, C.CustName, ISNULL(SUM(S2.SalesAmt),0) AS TotalSales FROM Cust C CROSS JOIN ( SELECT DISTINCT SaleDate FROM Sales ) AS S1 LEFT OUTER JOIN Sales S2 ON C.ID = S2.CustID AND S1.SaleDate = S2.SaleDate GROUP BY CONVERT(CHAR(6),S1.SaleDate,112),C.CustName HAVING ISNULL(SUM(S2.SalesAmt),0) = 0 ORDER BY CONVERT(CHAR(6),S1.SaleDate,112),C.CustName
Listing 9: CROSS JOIN against a distinct list of Sales Dates
In Listing 8 the CROSS JOIN operator joins 1,000 Cust records with 52,338 Sales records to produce a record set of 52,338,000 rows, which is then used to determine the customers who had zero sales in a month. In Listing 9, I changed my selection criteria from my Sales table to only return a distinct set of SalesDate values. This distinct set only produces 61 different SalesDate values so the result of the CROSS JOIN operation in Listing 9 only produces 61,000 records. By reducing the result set of the CROSS JOIN operation my query in Listing 9 ran in less than 1 second, whereas the code in Listing 8 runs in 19 seconds on my machine. The main reason for this performance difference is the sheer volume of records SQL Server needs to process for the different operations performed for each query. If you look at the execution plans for both listings, you will see that the plans are slightly different. But if you look at the estimated number of records generated from the Nested Loops (Inner Join) operation, on the right side of the graphical plan, you will see that Listing 8 estimates 52,338,000 records, whereas the same operation in Listing 9 only estimates 61,000 records. This large record set that Listing 8's query plan generates from the CROSS JOIN Nested Loops operation is then passed through to several additional operations. Because all these operations in Listing 8 had to work against 52 million records. Listing 8 is substantially slower than Listing 9.
As you can see the number of records used in the CROSS JOIN operation can dramatically affect the length of time that a query runs. Therefore if you can write your query to minimize the number of records involved in the CROSS JOIN operation, your query will perform much more efficiently.
The CROSS JOIN operator produces a Cartesian product between two record sets. This operator is useful in helping identify items in one table that don't have matching records in another table. Care should be taken to minimize the size of the record sets used with the CROSS JOIN operator. By making sure the result set of the CROSS JOIN is as small as possible, you will ensure your code runs as quickly as possible.
Question and Answer
In this section you can review how well you have understood using the CROSS JOIN operator by answering the following questions.
The CROSS JOIN operator creates a results set by matching two record sets based on the columns specified in the ON clause. (TRUE or FALSE)?
Which formula can be used to identify the number of rows that will be returned from an unconstrained CROSS JOIN between two tables A and B, when table A and B contain duplicate rows?
- Number of rows in table A times the number of rows in table B
- Number of rows in table A times the number of unique rows in table B
- Number of unique rows in table A times the number of rows in table B
- Number of unique rows in table A times the number of unique rows in table B
Which approach provides the best chance at reducing the size of the Cartesian product produced by the CROSS JOIN operation?
- Make sure the two sets being joined have as many rows as possible
- Make sure the two sets being joined have as few as rows as possible
- Make sure the set to the left of the CROSS JOIN operation has as few rows as possible
- Make sure the set to the right of the CROSS JOIN operation has as few rows as possible
The correct answer is b. The CROSS JOIN operator does not use an ON clause to perform the CROSS JOIN operation. It joins each row in one table to every row in the other table. The CROSS JOIN creates a Cartesian product when it joins two sets.
The correct answer is a. b, c, and d are incorrect because if there are duplicate rows in table A or B each duplicate row is join when creating the Cartesian product for the CROSS JOIN operation.
The correct answer is b. By reducing the size of both sets involved in the CROSS JOIN operation minimizes the size of the final set created by the CROSS JOI operation. c and d also help reduce the size of the final set created by the CROSS JOIN operation, but are not as optimal as making sure both sets involved in the CROSS JOIN operation have the fewest rows possible.