The core of the T-SQL Decommenter is a recursive parser that extracts tokens piece by piece from the T-SQL. The previous article documented the T-SQL elements the Decommenter must handle and matching rules for them, as well as illustrating how the recursive query iterates through a demonstration T-SQL batch. This article explains the details of the code that uses the matching rules to decomment T-SQL.
We'll be walking through the Decomment Common Table Expresion (CTE) in Decommenter_Demo.sql. The first chunk is pretty straight forward - it sets up the initial iteration row as described in the previous article
Decomment AS (
SELECT
ID,
0 AS Depth,
'_' AS TokType,
0 AS Iter,
0 AS IsFinished,
CAST(CASE WHEN OrigSQL IS NULL THEN NULL ELSE N'' END AS nvarchar(max))
AS NewSQL,
CAST(N'' AS nvarchar(max)) AS ActiveSQL,
CAST(OrigSQL AS nvarchar(max)) AS RemSQL,
CAST(NULL AS varchar(1)) AS MatTokType,
CAST(NULL AS int) AS DepthAdj,
CAST(NULL AS bigint) AS MatchIndex,
CAST(NULL AS int) AS MatchLen,
CAST(NULL AS nvarchar(max)) AS ClosedSQL,
OrigSQL,
ValidSQL,
MaxIter
FROM SourceSQL
There a couple of things here to notice:
- The order of the columns is a little strange. The first 8 columns are those required for
Decommentto function. The next 5 columns are valuable for debugging. The final 3 are related to the test suite. They were placed in this order to simplify eliminating the additional columns when creating the production query. - It uses
IDfromTestto identify which T-SQL batch is being decommented (the CTE can decomment multiple T-SQL batches at a time). - The anchor row initializes
NewSQLto an empty string unless the inboundOrigSQLisNULL, in which cases it initializesNewSQLtoNULL. - The purpose of
MaxIteris to be able to preemptively stop recursion without relying onMAXRECURSIONin theOPTIONclause for the query. The two issues withMAXRECURSIONare that it doesn't work with a variable and that when the query reaches the limit, iteration stops and the query fails. By having the logic of the query restrict recursion independently ofMAXRECURSION, it can preemptively stop decommenting the T-SQL and return partially decommented T-SQL. The partial decommenting is done carefully so that the decommenter either completely removes a comment or it returns the entire comment, thus ensuring that it doesn't negatively affect the uncommented portions of the T-SQL that lie beyond an incompletely removed comment due to the recursion limit halting iteration. In the production query,@MaxIteris used to specify the recursion limit for the query. In the test suite, we need the ability to control the iteration limit on a per-test basis, thus the use of theMaxItercolumn.
From this point on, the recursive CTE builds on the previous iteration to generate the next iteration. We'll skip over the SELECT in the UNION ALL and come back to it later after we've covered all the interesting calculations.
The first element in the FROM clause is to reference Decomment itself - this is what creates the recursive CTE. In addition, there's a predicate in the WHERE clause at the bottom that refers to Old - it ensures that once IsFinished becomes 1, we stop iterating.
FROM Decomment AS Old ... WHERE Old.IsFinished = 0
The next part of the FROM clause calculates the length of Old.RemSQL including trailing spaces. LEN doesn't include trailing spaces, so DATALENGTH is put to use, along with dividing by two since the T-SQL is nvarchar. I really like using CROSS APPLY to break up complicated calculations into understandable chunks that can be reused in subsequent calculations. When using CROSS APPLY for this purpose, I put the SELECT on the same line with the CROSS APPLY to indicate that this CROSS APPLY is solely doing calculations and that it does not access additional tables or affect the cardinality of the result.
CROSS APPLY ( SELECT ISNULL(DATALENGTH(Old.RemSQL), 0) / 2 AS RemLen ) AS RLen
TokTypes CTE for the previous record's TokType value:JOIN TokTypes AS OldTokType ON Old.TokType = OldTokType.TokType
Next, the query identifies all possible matches in RemSQL given the matchers permitted for the previous record's TokType value. The first step is to select from Matchers AS MR where Old.TokType = MR.TokType. It does a CROSS APPLY to calculate the CHARINDEX for P.Pattern in Old.RemSQL. Note that the special case of EF (End of File) for MatchStr is considered to match past the end of RemSQL. I use the CROSS APPLY so I can use the MatchIndex in three places without repeating the logic. The first use is to verify that MatchIndex is greater than 0 (we don't care about the matcher if it doesn't match) in the WHERE clause. The second is to return MatchIndex in the SELECT. The third is to use both the MatchIndex and MatchLen to order the matches so that the longest earliest match has precedence. This whole clause uses an OUTER APPLY because when the code gets to the end of the string, there will likely be no match, but we want to ensure we can still carry out the final recursion. Note that it would be simpler to use SELECT TOP (1) instead of the ROW_NUMBER() approach to identify the preferred match, but TOP (1) isn't supported in a recursive CTE, so the query uses the ROW_NUMBER() approach. Finally, if the iteration limit is reached on this iteration (Old.Iter + 1 < Old.MaxIter is not TRUE), no match will be made and the rest of the query will handle this special case similarly to how it handles the last iteration when fully decommenting T-SQL.
OUTER APPLY (
SELECT
C1.MatchIndex,
MR.MatchLen,
MR.NewTokType AS MatTokType,
MR.DepthAdj,
ROW_NUMBER() OVER (ORDER BY C1.MatchIndex, MR.MatchLen DESC) AS RowN
FROM Matchers AS MR
CROSS APPLY ( SELECT
CASE WHEN MR.MatchStr = N'EF' THEN RLen.RemLen + 1
ELSE CHARINDEX(MR.MatchStr, Old.RemSQL)
END AS MatchIndex
) AS C1
WHERE Old.TokType = MR.OldTokType
AND Old.Iter + 1 < Old.MaxIter
AND C1.MatchIndex > 0
) AS M
Having generated zero, one, or more matches, the WHERE clause at the bottom then filters for the preferred match - either filter for RowN is 1 or, if we didn't find any matches, the NULL from the OUTER APPLY will work.
WHERE Old.IsFinished = 0 AND (M.RowN = 1 OR M.RowN IS NULL)
From this point on, it's just a bunch of calculations in CROSS APPLY clauses. Multiple CROSS APPLY clauses are used because subsequent calculations frequently refer to earlier calculations and to break up the calculations in a somewhat logical manner.
The Meff calculation block calculates effective values for MatchIndex and MatchLen, handling the NULL values that result when there is no match in the OUTER APPLY:
CROSS APPLY ( SELECT ISNULL(M.MatchIndex, RLen.RemLen + 1) AS MatchIndex, ISNULL(M.MatchLen, 0) AS MatchLen ) AS MEff
- When
MatchIndexisNULL, it considers a match to have occurred just past the end ofRemSQL. - When
MatchLenisNULL, the match that is considered to have occurred just patch the end ofRemSQLis considered to be a 0-character wide match.
The New calculation block determines the new Depth, TokType, and Iter:
CROSS APPLY ( SELECT
Old.Depth + M.DepthAdj AS Depth,
CASE WHEN M.DepthAdj = -1 AND Old.Depth = 1 THEN '_'
ELSE ISNULL(M.MatTokType, '_')
END AS TokType,
Old.Iter + 1 AS Iter
) AS New - Calculating
New.Depthis done by addingM.DepthAdjto the previousDepth. - There are two alternatives for
New.TokType. WhenM.DepthAdjis-1and andOld.Depthis1, then this nestable token is terminating andNew.TokTypeshould be_. Otherwise,New.TokTypeisM.NewTokType, unlessM.NewTokTypeisNULL, in which caseNew.TokTypeis_. - The code increments
Old.Iterto generateNew.Iterso the query can keep track of the number of iterations.
The IsF calculation block determines the new IsFinished value:
CROSS APPLY ( SELECT
CASE WHEN New.TokType = '_'
AND MEff.MatchIndex - 1 + MEff.MatchLen >= RLen.RemLen THEN 1
ELSE 0 END AS IsFinished
) AS IsF
-
IsFinishedis1only ifNew.TokTypeis_and the combination ofMatchIndexandMatchLentakes us to or past the end of the previousRemSQL. While it might be a bit more understandable to simply determineIsFinishedusing theRemSQLcalculated in the nextCROSS APPLY, that would involve a performance penalty. The SQL Server Query Optimizer doesn't appear to actually instantiate intermediate calculations resulting fromCROSS APPLY, but instead recalculates the entire hierarchy for each use. CalculatingIsFinishedwithout actually generatingRemSQLfor that purpose helps minimize the amount of string manipulation done by the query engine, improving performance.
The Mat calculation block identifies useful chunks of Old.RemSQL:
CROSS APPLY ( SELECT LEFT(Old.RemSQL, MEff.MatchIndex - 1) AS LeftSQL, LEFT(Old.RemSQL, MEff.MatchIndex - 1 + MEff.MatchLen) AS LeftMatchSQL, SUBSTRING(Old.RemSQL, MEff.MatchIndex, MEff.MatchLen) AS MatchSQL, STUFF(Old.RemSQL, 1, MEff.MatchIndex - 1 + MEff.MatchLen, N'') AS RemSQL ) AS Mat
-
LeftSQLis everything before the match. -
LeftMatchSQLis everything up through and including the match. -
MatchSQLis the characters that actually matched. - While the query could simply concatenate
LeftSQLandMatchSQLin the downstream uses instead of usingLeftMatchSQL, it is more efficient to do one string manipulation than to do two and concatenate them. -
RemSQLis everything after the match. The use ofSTUFFhere to remove the first n characters from a string solves two problems. The first is that it removes the first n characters from a string without computing the length of the string. This is surprisingly useful, and shows up in the classic use of XML to concatenate strings. The second, and more important, is that the commonly usedRIGHT(Input, LEN(Input) - n)approach doesn't work if the string ends with one or more spaces! As specified by the ANSI standard, SQL Server ignores trailing spaces when computing string length, but does count the trailing spaces when returning a string withRIGHT. An alternative approach is to useDATALENGTH(Old.RemSQL) / 2to compute the length, butSTUFFis easier.
The TokSQL calculation block determines the new ClosedSQL and ActiveSQL values:
CROSS APPLY ( SELECT
CASE WHEN New.TokType = '_' THEN Old.ActiveSQL + Mat.LeftMatchSQL
WHEN Old.TokType = '_' THEN Old.ActiveSQL + Mat.LeftSQL
END AS ClosedSQL,
CASE WHEN New.TokType = '_' THEN N''
WHEN Old.TokType = '_' THEN Mat.MatchSQL
ELSE Old.ActiveSQL + Mat.LeftMatchSQL
END AS ActiveSQL
) AS TokSQL
- There are three alternatives for
ClosedSQL. IfNew.TokTypeis_, then a token is being terminated, so it concatenatesOld.ActiveSQLandMat.LeftMatchSQL. IfOld.TokTypeis_, then it is starting a new token, so it only concatenatesOld.ActiveSQLandMat.LeftSQL, sinceMat.MatchSQLbelongs to the token that is starting. Otherwise, the query is continuing to parse an existing token, soClosedSQLisNULL(the default for an unmatchedCASEwithout anELSEclause). - There are three alternatives for
ActiveSQL. IfNew.TokTypeis_, thenActiveSQLis an empty string. IfOld.TokTypeis_, thenActiveSQLis set toMat.MatchSQLbecause it starts this new token. Otherwise, the query is continuing to parse an existing token, so everything up through and including the match inOld.RemSQLis appended toOld.ActiveSQL.
The NewSQL calculation block determines the new NewSQL:
CROSS APPLY ( SELECT
Old.NewSQL + CASE
WHEN New.TokType <> '_' AND Old.TokType <> '_' THEN N''
WHEN M.MatTokType IS NULL THEN TokSQL.ClosedSQL
WHEN OldTokType.Replacement IS NOT NULL THEN OldTokType.Replacement
ELSE TokSQL.ClosedSQL
END AS NewSQL
) AS NewSQL
- There are four alternatives for what is appended to
Old.NewSQLto createUpdSQL.NewSQL. If neitherNew.TokTypenorOld.TokTypeare_, then an empty string is appropriate. IfM.MatTokType IS NULL, then we are matching to the end since no match was observed (either because there was no match or because we have reached the iteration limit) and should appendClosedSQL. IfOldTokType.Replacementspecifies a value, then we should throw away the contents ofTokSQL.ClosedSQLand substitute the replacement. If none of the previous are true, then appendTokSQL.ClosedSQL.
Once the appropriate calculations have been performed, the SELECT clause following the UNION ALL in this recursive CTE is fairly straightforward - it just returns the calculated values in the appropriate positions to match up with the columns of the anchor set.
The actual SELECT clause for the Decommenter_Demo.sql query does the following:
- It only selects the most important columns for understanding the behavior (although it is easy to add more if you want to observe more of the internal calculations).
- It adds leading and trailing pipe characters to the
*SQLfields to make leading and trailing whitespace characters easily identifiable in the output. - It aliases the columns to shorter names in order to make the output easier to view on a small screen.
In the next article, we will examine the comprehensive test suite used to develop and validate this T-SQL Decommenter, as well as documenting the transformations from the test suite query to the production query.