The core of the T-SQL Decommenter is a recursive parser that extracts tokens piece by piece from the T-SQL. The previous article documented the T-SQL elements the Decommenter must handle and matching rules for them, as well as illustrating how the recursive query iterates through a demonstration T-SQL batch. This article explains the details of the code that uses the matching rules to decomment T-SQL.
We'll be walking through the Decomment
Common Table Expresion (CTE) in Decommenter_Demo.sql
. The first chunk is pretty straight forward - it sets up the initial iteration row as described in the previous article
Decomment AS ( SELECT ID, 0 AS Depth, '_' AS TokType, 0 AS Iter, 0 AS IsFinished, CAST(CASE WHEN OrigSQL IS NULL THEN NULL ELSE N'' END AS nvarchar(max)) AS NewSQL, CAST(N'' AS nvarchar(max)) AS ActiveSQL, CAST(OrigSQL AS nvarchar(max)) AS RemSQL, CAST(NULL AS varchar(1)) AS MatTokType, CAST(NULL AS int) AS DepthAdj, CAST(NULL AS bigint) AS MatchIndex, CAST(NULL AS int) AS MatchLen, CAST(NULL AS nvarchar(max)) AS ClosedSQL, OrigSQL, ValidSQL, MaxIter FROM SourceSQL
There a couple of things here to notice:
- The order of the columns is a little strange. The first 8 columns are those required for
Decomment
to function. The next 5 columns are valuable for debugging. The final 3 are related to the test suite. They were placed in this order to simplify eliminating the additional columns when creating the production query. - It uses
ID
fromTest
to identify which T-SQL batch is being decommented (the CTE can decomment multiple T-SQL batches at a time). - The anchor row initializes
NewSQL
to an empty string unless the inboundOrigSQL
isNULL
, in which cases it initializesNewSQL
toNULL
. - The purpose of
MaxIter
is to be able to preemptively stop recursion without relying onMAXRECURSION
in theOPTION
clause for the query. The two issues withMAXRECURSION
are that it doesn't work with a variable and that when the query reaches the limit, iteration stops and the query fails. By having the logic of the query restrict recursion independently ofMAXRECURSION
, it can preemptively stop decommenting the T-SQL and return partially decommented T-SQL. The partial decommenting is done carefully so that the decommenter either completely removes a comment or it returns the entire comment, thus ensuring that it doesn't negatively affect the uncommented portions of the T-SQL that lie beyond an incompletely removed comment due to the recursion limit halting iteration. In the production query,@MaxIter
is used to specify the recursion limit for the query. In the test suite, we need the ability to control the iteration limit on a per-test basis, thus the use of theMaxIter
column.
From this point on, the recursive CTE builds on the previous iteration to generate the next iteration. We'll skip over the SELECT
in the UNION ALL
and come back to it later after we've covered all the interesting calculations.
The first element in the FROM
clause is to reference Decomment
itself - this is what creates the recursive CTE. In addition, there's a predicate in the WHERE
clause at the bottom that refers to Old
- it ensures that once IsFinished
becomes 1
, we stop iterating.
FROM Decomment AS Old ... WHERE Old.IsFinished = 0
The next part of the FROM
clause calculates the length of Old.RemSQL
including trailing spaces. LEN
doesn't include trailing spaces, so DATALENGTH
is put to use, along with dividing by two since the T-SQL is nvarchar
. I really like using CROSS APPLY
to break up complicated calculations into understandable chunks that can be reused in subsequent calculations. When using CROSS APPLY
for this purpose, I put the SELECT
on the same line with the CROSS APPLY
to indicate that this CROSS APPLY
is solely doing calculations and that it does not access additional tables or affect the cardinality of the result.
CROSS APPLY ( SELECT ISNULL(DATALENGTH(Old.RemSQL), 0) / 2 AS RemLen ) AS RLen
TokTypes
CTE for the previous record's TokType
value:JOIN TokTypes AS OldTokType ON Old.TokType = OldTokType.TokType
Next, the query identifies all possible matches in RemSQL
given the matchers permitted for the previous record's TokType
value. The first step is to select from Matchers AS MR
where Old.TokType = MR.TokType
. It does a CROSS APPLY
to calculate the CHARINDEX
for P.Pattern
in Old.RemSQL
. Note that the special case of EF
(End of File) for MatchStr
is considered to match past the end of RemSQL
. I use the CROSS APPLY
so I can use the MatchIndex
in three places without repeating the logic. The first use is to verify that MatchIndex
is greater than 0
(we don't care about the matcher if it doesn't match) in the WHERE
clause. The second is to return MatchIndex
in the SELECT
. The third is to use both the MatchIndex
and MatchLen
to order the matches so that the longest earliest match has precedence. This whole clause uses an OUTER APPLY
because when the code gets to the end of the string, there will likely be no match, but we want to ensure we can still carry out the final recursion. Note that it would be simpler to use SELECT TOP (1)
instead of the ROW_NUMBER()
approach to identify the preferred match, but TOP (1)
isn't supported in a recursive CTE, so the query uses the ROW_NUMBER()
approach. Finally, if the iteration limit is reached on this iteration (Old.Iter + 1 < Old.MaxIter
is not TRUE
), no match will be made and the rest of the query will handle this special case similarly to how it handles the last iteration when fully decommenting T-SQL.
OUTER APPLY ( SELECT C1.MatchIndex, MR.MatchLen, MR.NewTokType AS MatTokType, MR.DepthAdj, ROW_NUMBER() OVER (ORDER BY C1.MatchIndex, MR.MatchLen DESC) AS RowN FROM Matchers AS MR CROSS APPLY ( SELECT CASE WHEN MR.MatchStr = N'EF' THEN RLen.RemLen + 1 ELSE CHARINDEX(MR.MatchStr, Old.RemSQL) END AS MatchIndex ) AS C1 WHERE Old.TokType = MR.OldTokType AND Old.Iter + 1 < Old.MaxIter AND C1.MatchIndex > 0 ) AS M
Having generated zero, one, or more matches, the WHERE
clause at the bottom then filters for the preferred match - either filter for RowN
is 1
or, if we didn't find any matches, the NULL
from the OUTER APPLY
will work.
WHERE Old.IsFinished = 0 AND (M.RowN = 1 OR M.RowN IS NULL)
From this point on, it's just a bunch of calculations in CROSS APPLY
clauses. Multiple CROSS APPLY
clauses are used because subsequent calculations frequently refer to earlier calculations and to break up the calculations in a somewhat logical manner.
The Meff
calculation block calculates effective values for MatchIndex
and MatchLen
, handling the NULL
values that result when there is no match in the OUTER APPLY
:
CROSS APPLY ( SELECT ISNULL(M.MatchIndex, RLen.RemLen + 1) AS MatchIndex, ISNULL(M.MatchLen, 0) AS MatchLen ) AS MEff
- When
MatchIndex
isNULL
, it considers a match to have occurred just past the end ofRemSQL
. - When
MatchLen
isNULL
, the match that is considered to have occurred just patch the end ofRemSQL
is considered to be a 0-character wide match.
The New
calculation block determines the new Depth
, TokType
, and Iter
:
CROSS APPLY ( SELECT Old.Depth + M.DepthAdj AS Depth, CASE WHEN M.DepthAdj = -1 AND Old.Depth = 1 THEN '_' ELSE ISNULL(M.MatTokType, '_') END AS TokType, Old.Iter + 1 AS Iter ) AS New
- Calculating
New.Depth
is done by addingM.DepthAdj
to the previousDepth
. - There are two alternatives for
New.TokType
. WhenM.DepthAdj
is-1
and andOld.Depth
is1
, then this nestable token is terminating andNew.TokType
should be_
. Otherwise,New.TokType
isM.NewTokType
, unlessM.NewTokType
isNULL
, in which caseNew.TokType
is_
. - The code increments
Old.Iter
to generateNew.Iter
so the query can keep track of the number of iterations.
The IsF
calculation block determines the new IsFinished value:
CROSS APPLY ( SELECT CASE WHEN New.TokType = '_' AND MEff.MatchIndex - 1 + MEff.MatchLen >= RLen.RemLen THEN 1 ELSE 0 END AS IsFinished ) AS IsF
-
IsFinished
is1
only ifNew.TokType
is_
and the combination ofMatchIndex
andMatchLen
takes us to or past the end of the previousRemSQL
. While it might be a bit more understandable to simply determineIsFinished
using theRemSQL
calculated in the nextCROSS APPLY
, that would involve a performance penalty. The SQL Server Query Optimizer doesn't appear to actually instantiate intermediate calculations resulting fromCROSS APPLY
, but instead recalculates the entire hierarchy for each use. CalculatingIsFinished
without actually generatingRemSQL
for that purpose helps minimize the amount of string manipulation done by the query engine, improving performance.
The Mat
calculation block identifies useful chunks of Old.RemSQL
:
CROSS APPLY ( SELECT LEFT(Old.RemSQL, MEff.MatchIndex - 1) AS LeftSQL, LEFT(Old.RemSQL, MEff.MatchIndex - 1 + MEff.MatchLen) AS LeftMatchSQL, SUBSTRING(Old.RemSQL, MEff.MatchIndex, MEff.MatchLen) AS MatchSQL, STUFF(Old.RemSQL, 1, MEff.MatchIndex - 1 + MEff.MatchLen, N'') AS RemSQL ) AS Mat
-
LeftSQL
is everything before the match. -
LeftMatchSQL
is everything up through and including the match. -
MatchSQL
is the characters that actually matched. - While the query could simply concatenate
LeftSQL
andMatchSQL
in the downstream uses instead of usingLeftMatchSQL
, it is more efficient to do one string manipulation than to do two and concatenate them. -
RemSQL
is everything after the match. The use ofSTUFF
here to remove the first n characters from a string solves two problems. The first is that it removes the first n characters from a string without computing the length of the string. This is surprisingly useful, and shows up in the classic use of XML to concatenate strings. The second, and more important, is that the commonly usedRIGHT(Input, LEN(Input) - n)
approach doesn't work if the string ends with one or more spaces! As specified by the ANSI standard, SQL Server ignores trailing spaces when computing string length, but does count the trailing spaces when returning a string withRIGHT
. An alternative approach is to useDATALENGTH(Old.RemSQL) / 2
to compute the length, butSTUFF
is easier.
The TokSQL
calculation block determines the new ClosedSQL
and ActiveSQL
values:
CROSS APPLY ( SELECT CASE WHEN New.TokType = '_' THEN Old.ActiveSQL + Mat.LeftMatchSQL WHEN Old.TokType = '_' THEN Old.ActiveSQL + Mat.LeftSQL END AS ClosedSQL, CASE WHEN New.TokType = '_' THEN N'' WHEN Old.TokType = '_' THEN Mat.MatchSQL ELSE Old.ActiveSQL + Mat.LeftMatchSQL END AS ActiveSQL ) AS TokSQL
- There are three alternatives for
ClosedSQL
. IfNew.TokType
is_
, then a token is being terminated, so it concatenatesOld.ActiveSQL
andMat.LeftMatchSQL
. IfOld.TokType
is_
, then it is starting a new token, so it only concatenatesOld.ActiveSQL
andMat.LeftSQL
, sinceMat.MatchSQL
belongs to the token that is starting. Otherwise, the query is continuing to parse an existing token, soClosedSQL
isNULL
(the default for an unmatchedCASE
without anELSE
clause). - There are three alternatives for
ActiveSQL
. IfNew.TokType
is_
, thenActiveSQL
is an empty string. IfOld.TokType
is_
, thenActiveSQL
is set toMat.MatchSQL
because it starts this new token. Otherwise, the query is continuing to parse an existing token, so everything up through and including the match inOld.RemSQL
is appended toOld.ActiveSQL
.
The NewSQL
calculation block determines the new NewSQL
:
CROSS APPLY ( SELECT Old.NewSQL + CASE WHEN New.TokType <> '_' AND Old.TokType <> '_' THEN N'' WHEN M.MatTokType IS NULL THEN TokSQL.ClosedSQL WHEN OldTokType.Replacement IS NOT NULL THEN OldTokType.Replacement ELSE TokSQL.ClosedSQL END AS NewSQL ) AS NewSQL
- There are four alternatives for what is appended to
Old.NewSQL
to createUpdSQL.NewSQL
. If neitherNew.TokType
norOld.TokType
are_
, then an empty string is appropriate. IfM.MatTokType IS NULL
, then we are matching to the end since no match was observed (either because there was no match or because we have reached the iteration limit) and should appendClosedSQL
. IfOldTokType.Replacement
specifies a value, then we should throw away the contents ofTokSQL.ClosedSQL
and substitute the replacement. If none of the previous are true, then appendTokSQL.ClosedSQL
.
Once the appropriate calculations have been performed, the SELECT
clause following the UNION ALL
in this recursive CTE is fairly straightforward - it just returns the calculated values in the appropriate positions to match up with the columns of the anchor set.
The actual SELECT
clause for the Decommenter_Demo.sql
query does the following:
- It only selects the most important columns for understanding the behavior (although it is easy to add more if you want to observe more of the internal calculations).
- It adds leading and trailing pipe characters to the
*SQL
fields to make leading and trailing whitespace characters easily identifiable in the output. - It aliases the columns to shorter names in order to make the output easier to view on a small screen.
In the next article, we will examine the comprehensive test suite used to develop and validate this T-SQL Decommenter, as well as documenting the transformations from the test suite query to the production query.