Jeff Moden - Friday, February 23, 2018 2:19 PM
I wasn't making a detailed performance claim; I was describing a single instance of an effect I observed. But if you want the code I used, simply extend the code already listed one more level, or, if you're too lazy to do that, there's this article ( http://www.sqlservercentral.com/articles/T-SQL/74118/ ) written by some guy which has this algorithm extended to 10e+8. Unfortunately that guy didn't post his results past 8,000 rows, so you'll have to run the test yourself to verify how it slows down.
As for that article you referred to, it actually agrees with my secondary point: The rCTE does get significantly worse as it scales.
However, it doesn't address my primary point at all, which is that the constraints of your environment affect how effective different methods are.
In fact, that article doesn't even mention at all the environment its tests were run in. That's pretty much requirement 1 of any performance test -- especially in the case where you're testing one method which depends primarily on CPU against another method which depends primarily on RAM availability and I/O speed.
So let's run an actual test of my original premise:
1. What am I testing: I'm testing whether, in a low-count scenario where a system is memory-constrained, a rCTE tally CTE may perform better than a classic (ie, persistent on-disk) tally table.
2. Test Environment: I'm running these tests on a Leno Y40-70 laptop with a 2GHz Intel Core I7- 4510 with 16GB RAM and a Seagate 2TB
ST2000LX001. The Operating System is Windows 10 Pro (version 1709 build 16299.192), and the SQL Server instance is Microsoft SQL Server 2017 (RTM) - 14.0.1000.169 (X64) Developer Edition.
3. Test Methodology. I will run two scripts, one selecting the sum of the numbers 1 - 10 twice from a rCTE and the other selecting the sum of the numbers 1-10 twice from a classic tally table. Each script will be run a total of 10 times, 5 times with no other constraints, and then 5 times with DBCC DROPCLEANBUFFERS prefixed, to simulate a memory-constrained environment. Specifically, DBCC DROPCLEANBUFFERS will remove the classic tally table data (among other things) from SQL Server's buffer pool, causing it to reload it from disk (or disk cache) on the next query that uses it. This simulates a case where the system does not have the tally table loaded due to memory constraints (or even simply lack of use). Performance will be measured with SET STATISTICS TIME ON; SET STATISTICS IO ON will also be set to provide illustrative data on I/O usage.
The Classic Tally Table was created using the following schema, and was populated with numbers from 1 to 133448,704
CREATE TABLE [dbo].[Tally](
[TallyID] [int] NOT NULL,
CONSTRAINT [PK_Tally] PRIMARY KEY CLUSTERED
(
[TallyID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
4. Controlling for other variables: I am the only user of this machine and the only user of the SQL Server instance on this machine, so other users will not have an effect. I cannot directly control for the variable resource usage of other software loaded on the machine; however the multiple runs should average out any variable effect caused by that. Any significant outliers from those runs will be noted and removed from final analysis (additional runs will be performed in the event of an outlier to ensure every test is run the same amount of times.)
Scripts:
1. rCTE script:
--rCTE Tally Table
--Comment out DBCC DROPCLEANBUFFERS for first 5 runs; uncomment for second five runs
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS TIME ON
SET STATISTICS IO ON
GO
--Run A (first instance)
WITH cnumbers
AS (
SELECT 1 AS i
UNION ALL
SELECT i + 1
FROM cnumbers
WHERE i < 10
)
SELECT sum(i)
FROM cnumbers
OPTION (MAXRECURSION 0);
GO
--Run B (second instance)
WITH cnumbers
AS (
SELECT 1 AS i
UNION ALL
SELECT i + 1
FROM cnumbers
WHERE i < 10
)
SELECT sum(i)
FROM cnumbers
OPTION (MAXRECURSION 0);
GO
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
2. Classic Tally Table script:
--Classic Tally Table
--Comment out DBCC DROPCLEANBUFFERS for first 5 runs; uncomment for second five runs
DBCC DROPCLEANBUFFERS
GO
SET STATISTICS TIME ON
SET STATISTICS IO ON
GO
--Run A (first instance)
SELECT sum(TallyID)
FROM dbo.Tally
WHERE TallyID <= 10
GO
--Run B (second instance)
SELECT sum(TallyID)
FROM dbo.Tally
WHERE TallyID <= 10
GO
SET STATISTICS TIME OFF
SET STATISTICS IO OFF
Results (all numbers in millisecods):
No Constraints | Simulated RAM Constraint | |||||||
Run | rCTE A | rCTE B | Classic A | Classic B | rCTE A | rCTE B | Classic A | Classic B |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
Average | 0 | 0 | 0 | 0 | 0 | 0 | 2.8 | 0 |
Analysis:
The only queries which ever take any measurable amount of time in these scenarios are the Classic A queries under the RAM constrained scenario.
A quick look at STATISTICS IO shows the reason why:
Classic A: Table 'Tally'. Scan count 1, logical reads 4, physical reads 4...
Classic B: Table 'Tally'. Scan count 1, logical reads 4, physical reads 0...
The table is not in memory in the A instance, so it must be loaded from disk (or more likely in this case, disk cache). These 4 physical reads cost 2-3ms on this system.
For contrast, under all scenarios, the comparable rCTE STATISTICS IO were:
rCTE A: Table 'Worktable'. Scan count 2, logical reads 62, physical reads 0...
rCTE B: Table 'Worktable'. Scan count 2, logical reads 62, physical reads 0...
The table is never persisted to disk in this scenario, so there are no reads (or writes)
Notes:
I used SUM() rather than returning all rows merely to make reading the results easier. I had tested both methods, and switching to SUM() provided no measurable difference in execution times.
The only outlier found was run #3 of the Classic Tally Table A under the simulated RAM Constraint -- that took 131 milliseconds. This was most likely due to some background service happening to perform its own I/O. The last run (run 5) to replace the outlier was the only run to take less than 3ms (2ms).
While not part of the tested hypothesis, I did run the test a few more times using the numbers 1 - 1,000 instead of 1 - 10. As we all expected, the rCTE performed significantly worse than the Classic Tally Table in this case, averaging around 10ms per run as compared to 4ms per run in the Simulated RAM Constraint scenario.