RE: complicated sql syntax - looking for explanation for working query

SSC Guru

Points: 1003875

May 4, 2014 at 9:59 pm

Lynn Pettis (5/4/2014)
Expect a much more detailed explanation from Jeff...

You know me all too well, ol' friend. 😀

Ok... here we go, KoldCoffee...

No insult intended… I'm going to start at the very bottom because I don't know what you do or do not know and I want to make sure that you understand, especially since you asked.

First, what does this do?

SELECT 1

UNION ALL

SELECT 1

;

That's right. It returns two rows with a "1" on each row.

-----------

1

(2 row(s) affected)

The reason why it returned 2 rows instead of just one unique "1" is because a UNION ALL was used instead of just a UNION which has an implicit "distinct" I it.

The problem is, there's no title on the column for us to refer to the data by. Let's fix that right now. We could add the column name in all sorts of ugly ways but let's treat the query as if it were a table by adding it to the FROM clause of a new SELECT, giving it an alias as if it were a table, and then externally declare the column name. Like this…

SELECT d.N

FROM (

SELECT 1

UNION ALL

SELECT 1

) d (N)

;

See the "d (N)" thingy? The "d" is an alias for the result set that we're treating as if it were a table and the "(N) defines what we want to call the column. Here's the result…

N

-----------

1

Now, fully functional queries that return their own result sets are called "derived tables" (that's why I used the "d" alias here… anything could have been used but "d" reminds me that it's a "derived table" for this demo) or "inline views". What's really cool as of 2005 is that we can move such things into a "top down" ordering to make them more readable, a little easier to use, and a bit more flexible for other things. We can move them into a structure known as a "Common Table Expression" or "CTE", which is also classified as a "derived table" and "inline view". We basically just have to move the query from the FROM clause to the CTE and then reference the CTE from the outer SELECT. Like this…

WITH

d (N) AS

(

SELECT 1

UNION ALL

SELECT 1

)

SELECT N

FROM d

;

The next thing to do is to add more rows. Let's start with just 10. We'll also use those rows in the outer query to start counting using ROW_NUMBER(). Like this…

WITH

d (N) AS

( --=== This returns 10 rows of all 1's

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1

)

SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1

FROM d

;

Here's the result…

N

--------------------

0

1

2

3

4

5

6

7

8

9

A couple of things to notice here. Although ROW_NUMBER() requires an ORDER BY clause, we don't actually care what the ORDER BY is actually based on especially since all the rows from the "d" CTE have a "1" in them. So it's customary (and sometimes faster) to order by a constant. ROW_NUMBER() doesn't allow a constant in the ORDER BY so we trick it by embedding a SELECT statement that has the constant of NULL.

ROW_NUMBER() always starts at the number "1". If we want this sequence of 10 numbers to start at "0" instead of "1", one way to pull it off is to subtract 1 from every number that ROW_NUMBER() makes. That's what the purpose of the "-1" is (although I'd have done it in a different manner).

As a bit of a side bar, the 10 SELECT UNION ALLs can be replaced with a single VALUES function like that found in the code you posted on SQL SERVER 2008 and above. It doesn't make things any faster but it does cut down on the clutter in the query.

Time for another question.. What does this do?

SELECT 1

FROM (SELECT 1 UNION ALL SELECT 1) d1 (N)

CROSS JOIN (SELECT 1 UNION ALL SELECT 1) d2 (N)

;

If we look at the code, it has two "derived tables" in it and each "derived table" has two rows in it. We're doing a CROSS JOIN (also known as a "Cartesian Product") between the two "derived tables". That means for every row in one of the "derived tables", we'll return all of the rows of the other "derived table". Think of it as us multiplying the number of rows in the "derived tables" to come up with a "Product" or answer to the multiplication problem.

Since there are 2 rows in each of the "derived tables", our result set will contain 2*2 or 4 rows of "1's". Like this…

-----------

1

If we had 10 rows in each "derived table", we would have ended up with 10*10 or 100 rows each containing a "1".

Note the we can use the NON-ANSI version of a CROSS JOIN, which is nothing more than listing the 2 derived tables in the FROM clause separated only by a comma and no criteria. Like this…

SELECT 1

FROM (SELECT 1 UNION ALL SELECT 1) d1 (N),

(SELECT 1 UNION ALL SELECT 1) d2 (N)

;

Shifting gears a bit, the "engineering notation" for "10" is 1*10^1. The "engineering notation" for "100" is 1*10^2. The exponent tells how many times "10" has been multiplied by itself.

Let's apply all of that to our CTE so we can count to 100 using CTEs.

WITH

E1(N) AS

( --=== This returns 10 rows of all 1's

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1

),

E2(N) AS (SELECT 1 FROM E1 a, E1 b) --CROSS JOIN gives us 10*10 or 100 rows

SELECT N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1

FROM E2

;

Here, we change the name of the first CTE to represent the power of 10 that it produces. The shorthand for "engineering notation" for 1*10^1 is 1E1. We just dropped the leading 1 because it's superfluous for us.

The neat thing about CTEs is that, unlike "derived tables" in a FROM clause, we can refer to them in other "derived tables". So, our second CTE (E2), creates another "derived table" by doing a CROSS JOIN on the first (E1) "derived table" (CTE) against itself and that will return 100 rows. The "a" and "b" are just different aliases that don't mean anything but are necessary or SQL Server will complain about referencing the same "derived table" more than once.

Another name for one CTE referring to another CTE is called "Cascading CTEs". You probably won't find that particular name in a book anywhere. We made up that name right here on SSC.

The final SELECT was modified to take its input from the E2 CTE and returns the values of 0 to 99.

What if we only wanted to count to 53 and didn't want to change the code for either the E1 or E2 CTEs? That's where the TOP function comes in. Like this…

WITH

E1(N) AS

( --=== This returns 10 rows of all 1's

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1

),

E2(N) AS (SELECT 1 FROM E1 a, E1 b) --Cross join gives us 10*10 or 100 rows

SELECT TOP (53) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1

FROM E2

;

Which 53 numbers will it return? The ORDER BY of the ROW_NUMBER() takes care of that. It'll return the numbers 1 through 53. Since we're subtracting 1 from each number, the final result set will be 0 through 52, in this case. We could use a variable instead of a hard-coded "53" to make this cascading, counting CTE even more useful.

Now… what if we wanted a really big number like the number 6307204 that was used in the original code you posted? The answer is that we just need more CROSS JOINs in the form of additional cascading CTEs. The really cool part is that if we multiply the number of rows in E2 (which is 100 rows) times itself, we don't just multiply by 10. No, we're multiplying by 100 now. A CROSS JOIN on E2 will give use E4 or 100*100 which is the same as 1*10^4 or 10000. Why E4? If you add exponents (E2+E2), you're actually multiplying. E2 is 100 with 2 zeros after a 1. 10000 is a 1 with 4 zeros after it. We just doubled the number of zeros.

Once we have E4, it's simple to see how to get to E8, which will return E4*E4 or 10000*10000 or a 1 with 8 zeros after it. That’s 100,000,000 or a 100 MILLION!

Here's the code for that. Looks almost identical to the original code.

WITH

E1(N) AS

( --=== This returns 10 rows of all 1's

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1

),

E2(N) AS (SELECT 1 FROM E1 a, E1 b), --CROSS JOIN gives us 10*10 or 100 rows

E4(N) AS (SELECT 1 FROM E2 a, E2 b), --CROSS JOIN gives us 100*100 or 10000 rows

E8(N) AS (SELECT 1 FROM E4 a, E4 b) --CROSS JOIN gives us 10000*10000 or 100000000 rows

SELECT TOP (53) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1

FROM E8

;

Notice that each new CTE refers to the previous CTE in the FROM clause and that the final SELECT refers to the last CTE.

What the ROW_NUMBER() thing is doing is that it is using the "presence of rows" from the last CTE as a "cursor" or "loop". It's not even using the "1's" from the CTE's, just the "presence of rows". This is what R. Barry Young refers to as a "pseudo cursor". It uses the very, very fast machine language "loops" that SQL Server does behind the scenes instead of an explicit loop. ROW_NUMBER() simply counts by "1's" for each row that it encounters.

Like I said, there are some optimizations that could be made here. For one (no pun intended) we can UNION ALL an explicit "0" to make counting faster by not having to subtract 1 for each count. For larger numbers, this can add up to quite a time savings.

Another performance improvement is to limit the number of "nested loops" behind the scenes. We can do that simply by doing more CROSS JOINs in the E2 CTE and cascading that effect "down". Notice that the E2 CTE has been updated to 4 CROSS JOINs and renamed to be E4.

Like this…

WITH

E1(N) AS

( --=== This returns 10 rows of all 1's

SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL

SELECT 1

),

E4(N) AS (SELECT 1 FROM E1 a, E1 b, E1 c, E1 d), -- 10*10*10*10 or 10000 rows

E8(N) AS (SELECT 1 FROM E4 a, E4 b) -- 10000*10000 or 100000000 rows

SELECT 0 UNION ALL --Removes the need to subtract 1 from every count

SELECT TOP (53-1) N = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM E8

;

Last but certainly not least, the cascading CTEs are "cascading" in appearance only. They all become part of the overall query. What that means is that they don't count up to 100 Million (in this case) and then apply the TOP (). The TOP is expressed across ALL of the CTE's until it's satisfied as to the number of rows. If the TOP() was only "3", then only the first 3 SELECT UNION ALLs would execute. E4 and E8 would be present but they would do "0" rows.

Lemme know if you have any other questions about this wonderful tool first created by Itzik Ben-Gan and improved over time not only by Itizik, but by others on this forum.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)