The core of the T-SQL Decommenter is a recursive parser that extracts tokens piece by piece from the T-SQL. The previous article documented the T-SQL elements the Decommenter must handle and matching rules for them, as well as illustrating how the recursive query iterates through a demonstration T-SQL batch. This article explains the details of the code that uses the matching rules to decomment T-SQL.
We'll be walking through the
Decomment Common Table Expresion (CTE) in
Decommenter_Demo.sql. The first chunk is pretty straight forward - it sets up the initial iteration row as described in the previous article
Decomment AS ( SELECT ID, 0 AS Depth, '_' AS TokType, 0 AS Iter, 0 AS IsFinished, CAST(CASE WHEN OrigSQL IS NULL THEN NULL ELSE N'' END AS nvarchar(max)) AS NewSQL, CAST(N'' AS nvarchar(max)) AS ActiveSQL, CAST(OrigSQL AS nvarchar(max)) AS RemSQL, CAST(NULL AS varchar(1)) AS MatTokType, CAST(NULL AS int) AS DepthAdj, CAST(NULL AS bigint) AS MatchIndex, CAST(NULL AS int) AS MatchLen, CAST(NULL AS nvarchar(max)) AS ClosedSQL, OrigSQL, ValidSQL, MaxIter FROM SourceSQL
There a couple of things here to notice:
- The order of the columns is a little strange. The first 8 columns are those required for
Decommentto function. The next 5 columns are valuable for debugging. The final 3 are related to the test suite. They were placed in this order to simplify eliminating the additional columns when creating the production query.
- It uses
Testto identify which T-SQL batch is being decommented (the CTE can decomment multiple T-SQL batches at a time).
- The anchor row initializes
NewSQLto an empty string unless the inbound
NULL, in which cases it initializes
- The purpose of
MaxIteris to be able to preemptively stop recursion without relying on
OPTIONclause for the query. The two issues with
MAXRECURSIONare that it doesn't work with a variable and that when the query reaches the limit, iteration stops and the query fails. By having the logic of the query restrict recursion independently of
MAXRECURSION, it can preemptively stop decommenting the T-SQL and return partially decommented T-SQL. The partial decommenting is done carefully so that the decommenter either completely removes a comment or it returns the entire comment, thus ensuring that it doesn't negatively affect the uncommented portions of the T-SQL that lie beyond an incompletely removed comment due to the recursion limit halting iteration. In the production query,
@MaxIteris used to specify the recursion limit for the query. In the test suite, we need the ability to control the iteration limit on a per-test basis, thus the use of the
From this point on, the recursive CTE builds on the previous iteration to generate the next iteration. We'll skip over the
SELECT in the
UNION ALL and come back to it later after we've covered all the interesting calculations.
The first element in the
FROM clause is to reference
Decomment itself - this is what creates the recursive CTE. In addition, there's a predicate in the
WHERE clause at the bottom that refers to
Old - it ensures that once
1, we stop iterating.
FROM Decomment AS Old ... WHERE Old.IsFinished = 0
The next part of the
FROM clause calculates the length of
Old.RemSQL including trailing spaces.
LEN doesn't include trailing spaces, so
DATALENGTH is put to use, along with dividing by two since the T-SQL is
nvarchar. I really like using
CROSS APPLY to break up complicated calculations into understandable chunks that can be reused in subsequent calculations. When using
CROSS APPLY for this purpose, I put the
SELECT on the same line with the
CROSS APPLY to indicate that this
CROSS APPLY is solely doing calculations and that it does not access additional tables or affect the cardinality of the result.
CROSS APPLY ( SELECT ISNULL(DATALENGTH(Old.RemSQL), 0) / 2 AS RemLen ) AS RLen
TokTypesCTE for the previous record's
JOIN TokTypes AS OldTokType ON Old.TokType = OldTokType.TokType
Next, the query identifies all possible matches in
RemSQL given the matchers permitted for the previous record's
TokType value. The first step is to select from
Matchers AS MR where
Old.TokType = MR.TokType. It does a
CROSS APPLY to calculate the
Old.RemSQL. Note that the special case of
EF (End of File) for
MatchStr is considered to match past the end of
RemSQL. I use the
CROSS APPLY so I can use the
MatchIndex in three places without repeating the logic. The first use is to verify that
MatchIndex is greater than
0 (we don't care about the matcher if it doesn't match) in the
WHERE clause. The second is to return
MatchIndex in the
SELECT. The third is to use both the
MatchLen to order the matches so that the longest earliest match has precedence. This whole clause uses an
OUTER APPLY because when the code gets to the end of the string, there will likely be no match, but we want to ensure we can still carry out the final recursion. Note that it would be simpler to use
SELECT TOP (1) instead of the
ROW_NUMBER() approach to identify the preferred match, but
TOP (1) isn't supported in a recursive CTE, so the query uses the
ROW_NUMBER() approach. Finally, if the iteration limit is reached on this iteration (
Old.Iter + 1 < Old.MaxIter is not
TRUE), no match will be made and the rest of the query will handle this special case similarly to how it handles the last iteration when fully decommenting T-SQL.
OUTER APPLY ( SELECT C1.MatchIndex, MR.MatchLen, MR.NewTokType AS MatTokType, MR.DepthAdj, ROW_NUMBER() OVER (ORDER BY C1.MatchIndex, MR.MatchLen DESC) AS RowN FROM Matchers AS MR CROSS APPLY ( SELECT CASE WHEN MR.MatchStr = N'EF' THEN RLen.RemLen + 1 ELSE CHARINDEX(MR.MatchStr, Old.RemSQL) END AS MatchIndex ) AS C1 WHERE Old.TokType = MR.OldTokType AND Old.Iter + 1 < Old.MaxIter AND C1.MatchIndex > 0 ) AS M
Having generated zero, one, or more matches, the
WHERE clause at the bottom then filters for the preferred match - either filter for
1 or, if we didn't find any matches, the
NULL from the
OUTER APPLY will work.
WHERE Old.IsFinished = 0 AND (M.RowN = 1 OR M.RowN IS NULL)
From this point on, it's just a bunch of calculations in
CROSS APPLY clauses. Multiple
CROSS APPLY clauses are used because subsequent calculations frequently refer to earlier calculations and to break up the calculations in a somewhat logical manner.
Meff calculation block calculates effective values for
MatchLen, handling the
NULL values that result when there is no match in the
CROSS APPLY ( SELECT ISNULL(M.MatchIndex, RLen.RemLen + 1) AS MatchIndex, ISNULL(M.MatchLen, 0) AS MatchLen ) AS MEff
NULL, it considers a match to have occurred just past the end of
NULL, the match that is considered to have occurred just patch the end of
RemSQLis considered to be a 0-character wide match.
New calculation block determines the new
CROSS APPLY ( SELECT Old.Depth + M.DepthAdj AS Depth, CASE WHEN M.DepthAdj = -1 AND Old.Depth = 1 THEN '_' ELSE ISNULL(M.MatTokType, '_') END AS TokType, Old.Iter + 1 AS Iter ) AS New
New.Depthis done by adding
M.DepthAdjto the previous
- There are two alternatives for
1, then this nestable token is terminating and
NULL, in which case
- The code increments
New.Iterso the query can keep track of the number of iterations.
IsF calculation block determines the new IsFinished value:
CROSS APPLY ( SELECT CASE WHEN New.TokType = '_' AND MEff.MatchIndex - 1 + MEff.MatchLen >= RLen.RemLen THEN 1 ELSE 0 END AS IsFinished ) AS IsF
_and the combination of
MatchLentakes us to or past the end of the previous
RemSQL. While it might be a bit more understandable to simply determine
RemSQLcalculated in the next
CROSS APPLY, that would involve a performance penalty. The SQL Server Query Optimizer doesn't appear to actually instantiate intermediate calculations resulting from
CROSS APPLY, but instead recalculates the entire hierarchy for each use. Calculating
IsFinishedwithout actually generating
RemSQLfor that purpose helps minimize the amount of string manipulation done by the query engine, improving performance.
Mat calculation block identifies useful chunks of
CROSS APPLY ( SELECT LEFT(Old.RemSQL, MEff.MatchIndex - 1) AS LeftSQL, LEFT(Old.RemSQL, MEff.MatchIndex - 1 + MEff.MatchLen) AS LeftMatchSQL, SUBSTRING(Old.RemSQL, MEff.MatchIndex, MEff.MatchLen) AS MatchSQL, STUFF(Old.RemSQL, 1, MEff.MatchIndex - 1 + MEff.MatchLen, N'') AS RemSQL ) AS Mat
LeftSQLis everything before the match.
LeftMatchSQLis everything up through and including the match.
MatchSQLis the characters that actually matched.
- While the query could simply concatenate
MatchSQLin the downstream uses instead of using
LeftMatchSQL, it is more efficient to do one string manipulation than to do two and concatenate them.
RemSQLis everything after the match. The use of
STUFFhere to remove the first n characters from a string solves two problems. The first is that it removes the first n characters from a string without computing the length of the string. This is surprisingly useful, and shows up in the classic use of XML to concatenate strings. The second, and more important, is that the commonly used
RIGHT(Input, LEN(Input) - n)approach doesn't work if the string ends with one or more spaces! As specified by the ANSI standard, SQL Server ignores trailing spaces when computing string length, but does count the trailing spaces when returning a string with
RIGHT. An alternative approach is to use
DATALENGTH(Old.RemSQL) / 2to compute the length, but
TokSQL calculation block determines the new
CROSS APPLY ( SELECT CASE WHEN New.TokType = '_' THEN Old.ActiveSQL + Mat.LeftMatchSQL WHEN Old.TokType = '_' THEN Old.ActiveSQL + Mat.LeftSQL END AS ClosedSQL, CASE WHEN New.TokType = '_' THEN N'' WHEN Old.TokType = '_' THEN Mat.MatchSQL ELSE Old.ActiveSQL + Mat.LeftMatchSQL END AS ActiveSQL ) AS TokSQL
- There are three alternatives for
_, then a token is being terminated, so it concatenates
_, then it is starting a new token, so it only concatenates
Mat.MatchSQLbelongs to the token that is starting. Otherwise, the query is continuing to parse an existing token, so
NULL(the default for an unmatched
- There are three alternatives for
ActiveSQLis an empty string. If
ActiveSQLis set to
Mat.MatchSQLbecause it starts this new token. Otherwise, the query is continuing to parse an existing token, so everything up through and including the match in
Old.RemSQLis appended to
NewSQL calculation block determines the new
CROSS APPLY ( SELECT Old.NewSQL + CASE WHEN New.TokType <> '_' AND Old.TokType <> '_' THEN N'' WHEN M.MatTokType IS NULL THEN TokSQL.ClosedSQL WHEN OldTokType.Replacement IS NOT NULL THEN OldTokType.Replacement ELSE TokSQL.ClosedSQL END AS NewSQL ) AS NewSQL
- There are four alternatives for what is appended to
UpdSQL.NewSQL. If neither
_, then an empty string is appropriate. If
M.MatTokType IS NULL, then we are matching to the end since no match was observed (either because there was no match or because we have reached the iteration limit) and should append
OldTokType.Replacementspecifies a value, then we should throw away the contents of
TokSQL.ClosedSQLand substitute the replacement. If none of the previous are true, then append
Once the appropriate calculations have been performed, the
SELECT clause following the
UNION ALL in this recursive CTE is fairly straightforward - it just returns the calculated values in the appropriate positions to match up with the columns of the anchor set.
SELECT clause for the
Decommenter_Demo.sql query does the following:
- It only selects the most important columns for understanding the behavior (although it is easy to add more if you want to observe more of the internal calculations).
- It adds leading and trailing pipe characters to the
*SQLfields to make leading and trailing whitespace characters easily identifiable in the output.
- It aliases the columns to shorter names in order to make the output easier to view on a small screen.
In the next article, we will examine the comprehensive test suite used to develop and validate this T-SQL Decommenter, as well as documenting the transformations from the test suite query to the production query.