T-SQL Decommenter Part III

The core of the T-SQL Decommenter is a recursive parser that extracts tokens piece by piece from the T-SQL. The previous article documented the T-SQL elements the Decommenter must handle and matching rules for them, as well as illustrating how the recursive query iterates through a demonstration T-SQL batch. This article explains the details of the code that uses the matching rules to decomment T-SQL.

We'll be walking through the Decomment Common Table Expresion (CTE) in Decommenter_Demo.sql. The first chunk is pretty straight forward - it sets up the initial iteration row as described in the previous article

Decomment AS (
  SELECT
    ID,
    0       AS Depth,
    '_'       AS TokType,
    0       AS Iter,
    0       AS IsFinished,
    CAST(CASE WHEN OrigSQL IS NULL THEN NULL ELSE N'' END AS nvarchar(max))
            AS NewSQL,
    CAST(N'' AS nvarchar(max))  AS ActiveSQL,
    CAST(OrigSQL AS nvarchar(max))  AS RemSQL,
    CAST(NULL AS varchar(1))  AS MatTokType,
    CAST(NULL AS int)   AS DepthAdj,
    CAST(NULL AS bigint)    AS MatchIndex,
    CAST(NULL AS int)   AS MatchLen,
    CAST(NULL AS nvarchar(max)) AS ClosedSQL,
    OrigSQL,
    ValidSQL,
    MaxIter
  FROM SourceSQL

There a couple of things here to notice:

The order of the columns is a little strange. The first 8 columns are those required for Decomment to function. The next 5 columns are valuable for debugging. The final 3 are related to the test suite. They were placed in this order to simplify eliminating the additional columns when creating the production query.
It uses ID from Test to identify which T-SQL batch is being decommented (the CTE can decomment multiple T-SQL batches at a time).
The anchor row initializes NewSQL to an empty string unless the inbound OrigSQL is NULL, in which cases it initializes NewSQL to NULL.
The purpose of MaxIter is to be able to preemptively stop recursion without relying on MAXRECURSION in the OPTION clause for the query. The two issues with MAXRECURSION are that it doesn't work with a variable and that when the query reaches the limit, iteration stops and the query fails. By having the logic of the query restrict recursion independently of MAXRECURSION, it can preemptively stop decommenting the T-SQL and return partially decommented T-SQL. The partial decommenting is done carefully so that the decommenter either completely removes a comment or it returns the entire comment, thus ensuring that it doesn't negatively affect the uncommented portions of the T-SQL that lie beyond an incompletely removed comment due to the recursion limit halting iteration. In the production query, @MaxIter is used to specify the recursion limit for the query. In the test suite, we need the ability to control the iteration limit on a per-test basis, thus the use of the MaxIter column.

From this point on, the recursive CTE builds on the previous iteration to generate the next iteration. We'll skip over the SELECT in the UNION ALL and come back to it later after we've covered all the interesting calculations.

The first element in the FROM clause is to reference Decomment itself - this is what creates the recursive CTE. In addition, there's a predicate in the WHERE clause at the bottom that refers to Old - it ensures that once IsFinished becomes 1, we stop iterating.

FROM Decomment AS Old
...
WHERE Old.IsFinished = 0

The next part of the FROM clause calculates the length of Old.RemSQL including trailing spaces. LEN doesn't include trailing spaces, so DATALENGTH is put to use, along with dividing by two since the T-SQL is nvarchar. I really like using CROSS APPLY to break up complicated calculations into understandable chunks that can be reused in subsequent calculations. When using CROSS APPLY for this purpose, I put the SELECT on the same line with the CROSS APPLY to indicate that this CROSS APPLY is solely doing calculations and that it does not access additional tables or affect the cardinality of the result.

CROSS APPLY ( SELECT
  ISNULL(DATALENGTH(Old.RemSQL), 0) / 2 AS RemLen
) AS RLen

Next, the query retrieves information from the TokTypes CTE for the previous record's TokType value:

JOIN TokTypes AS OldTokType
  ON Old.TokType = OldTokType.TokType

Next, the query identifies all possible matches in RemSQL given the matchers permitted for the previous record's TokType value. The first step is to select from Matchers AS MR where Old.TokType = MR.TokType. It does a CROSS APPLY to calculate the CHARINDEX for P.Pattern in Old.RemSQL. Note that the special case of EF (End of File) for MatchStr is considered to match past the end of RemSQL. I use the CROSS APPLY so I can use the MatchIndex in three places without repeating the logic. The first use is to verify that MatchIndex is greater than 0 (we don't care about the matcher if it doesn't match) in the WHERE clause. The second is to return MatchIndex in the SELECT. The third is to use both the MatchIndex and MatchLen to order the matches so that the longest earliest match has precedence. This whole clause uses an OUTER APPLY because when the code gets to the end of the string, there will likely be no match, but we want to ensure we can still carry out the final recursion. Note that it would be simpler to use SELECT TOP (1) instead of the ROW_NUMBER() approach to identify the preferred match, but TOP (1) isn't supported in a recursive CTE, so the query uses the ROW_NUMBER() approach. Finally, if the iteration limit is reached on this iteration (Old.Iter + 1 < Old.MaxIter is not TRUE), no match will be made and the rest of the query will handle this special case similarly to how it handles the last iteration when fully decommenting T-SQL.

OUTER APPLY (
  SELECT
    C1.MatchIndex,
    MR.MatchLen,
    MR.NewTokType AS MatTokType,
    MR.DepthAdj,
    ROW_NUMBER() OVER (ORDER BY C1.MatchIndex, MR.MatchLen DESC) AS RowN
  FROM Matchers AS MR
    CROSS APPLY ( SELECT
      CASE WHEN MR.MatchStr = N'EF' THEN RLen.RemLen + 1
         ELSE CHARINDEX(MR.MatchStr, Old.RemSQL)
      END AS MatchIndex
    ) AS C1
  WHERE Old.TokType = MR.OldTokType
    AND Old.Iter + 1 < Old.MaxIter
    AND C1.MatchIndex > 0
) AS M

Having generated zero, one, or more matches, the WHERE clause at the bottom then filters for the preferred match - either filter for RowN is 1 or, if we didn't find any matches, the NULL from the OUTER APPLY will work.

WHERE Old.IsFinished = 0
  AND (M.RowN = 1 OR M.RowN IS NULL)

From this point on, it's just a bunch of calculations in CROSS APPLY clauses. Multiple CROSS APPLY clauses are used because subsequent calculations frequently refer to earlier calculations and to break up the calculations in a somewhat logical manner.

The Meff calculation block calculates effective values for MatchIndex and MatchLen, handling the NULL values that result when there is no match in the OUTER APPLY:

CROSS APPLY ( SELECT
  ISNULL(M.MatchIndex, RLen.RemLen + 1) AS MatchIndex,
  ISNULL(M.MatchLen, 0) AS MatchLen
) AS MEff

When MatchIndex is NULL, it considers a match to have occurred just past the end of RemSQL.
When MatchLen is NULL, the match that is considered to have occurred just patch the end of RemSQL is considered to be a 0-character wide match.

The New calculation block determines the new Depth, TokType, and Iter:

CROSS APPLY ( SELECT
  Old.Depth + M.DepthAdj AS Depth,
  CASE WHEN M.DepthAdj = -1 AND Old.Depth = 1 THEN '_'
       ELSE ISNULL(M.MatTokType, '_')
  END AS TokType,
  Old.Iter + 1 AS Iter
) AS New

Calculating New.Depth is done by adding M.DepthAdj to the previous Depth.
There are two alternatives for New.TokType. When M.DepthAdj is -1 and and Old.Depth is 1, then this nestable token is terminating and New.TokType should be _. Otherwise, New.TokType is M.NewTokType, unless M.NewTokType is NULL, in which case New.TokType is _.
The code increments Old.Iter to generate New.Iter so the query can keep track of the number of iterations.

The IsF calculation block determines the new IsFinished value:

CROSS APPLY ( SELECT
  CASE WHEN New.TokType = '_'
    AND MEff.MatchIndex - 1 + MEff.MatchLen >= RLen.RemLen THEN 1
       ELSE 0 END AS IsFinished
) AS IsF

IsFinished is 1 only if New.TokType is _ and the combination of MatchIndex and MatchLen takes us to or past the end of the previous RemSQL. While it might be a bit more understandable to simply determine IsFinished using the RemSQL calculated in the next CROSS APPLY, that would involve a performance penalty. The SQL Server Query Optimizer doesn't appear to actually instantiate intermediate calculations resulting from CROSS APPLY, but instead recalculates the entire hierarchy for each use. Calculating IsFinished without actually generating RemSQL for that purpose helps minimize the amount of string manipulation done by the query engine, improving performance.

The Mat calculation block identifies useful chunks of Old.RemSQL:

CROSS APPLY ( SELECT
  LEFT(Old.RemSQL, MEff.MatchIndex - 1) AS LeftSQL,
  LEFT(Old.RemSQL, MEff.MatchIndex - 1 + MEff.MatchLen) AS LeftMatchSQL,
  SUBSTRING(Old.RemSQL, MEff.MatchIndex, MEff.MatchLen) AS MatchSQL,
  STUFF(Old.RemSQL, 1, MEff.MatchIndex - 1 + MEff.MatchLen, N'') AS RemSQL
) AS Mat

LeftSQL is everything before the match.
LeftMatchSQL is everything up through and including the match.
MatchSQL is the characters that actually matched.
While the query could simply concatenate LeftSQL and MatchSQL in the downstream uses instead of using LeftMatchSQL, it is more efficient to do one string manipulation than to do two and concatenate them.
RemSQL is everything after the match. The use of STUFF here to remove the first n characters from a string solves two problems. The first is that it removes the first n characters from a string without computing the length of the string. This is surprisingly useful, and shows up in the classic use of XML to concatenate strings. The second, and more important, is that the commonly used RIGHT(Input, LEN(Input) - n) approach doesn't work if the string ends with one or more spaces! As specified by the ANSI standard, SQL Server ignores trailing spaces when computing string length, but does count the trailing spaces when returning a string with RIGHT. An alternative approach is to use DATALENGTH(Old.RemSQL) / 2 to compute the length, but STUFF is easier.

The TokSQL calculation block determines the new ClosedSQL and ActiveSQL values:

CROSS APPLY ( SELECT
  CASE WHEN New.TokType = '_' THEN Old.ActiveSQL + Mat.LeftMatchSQL
       WHEN Old.TokType = '_' THEN Old.ActiveSQL + Mat.LeftSQL
  END AS ClosedSQL,
  CASE WHEN New.TokType = '_' THEN N''
       WHEN Old.TokType = '_' THEN Mat.MatchSQL
       ELSE Old.ActiveSQL + Mat.LeftMatchSQL
  END AS ActiveSQL
) AS TokSQL

There are three alternatives for ClosedSQL. If New.TokType is _, then a token is being terminated, so it concatenates Old.ActiveSQL and Mat.LeftMatchSQL. If Old.TokType is _, then it is starting a new token, so it only concatenates Old.ActiveSQL and Mat.LeftSQL, since Mat.MatchSQL belongs to the token that is starting. Otherwise, the query is continuing to parse an existing token, so ClosedSQL is NULL (the default for an unmatched CASE without an ELSE clause).
There are three alternatives for ActiveSQL. If New.TokType is _, then ActiveSQL is an empty string. If Old.TokType is _, then ActiveSQL is set to Mat.MatchSQL because it starts this new token. Otherwise, the query is continuing to parse an existing token, so everything up through and including the match in Old.RemSQL is appended to Old.ActiveSQL.

The NewSQL calculation block determines the new NewSQL:

CROSS APPLY ( SELECT
  Old.NewSQL + CASE
    WHEN New.TokType <> '_' AND Old.TokType <> '_' THEN N''
    WHEN M.MatTokType IS NULL THEN TokSQL.ClosedSQL
    WHEN OldTokType.Replacement IS NOT NULL THEN OldTokType.Replacement
    ELSE TokSQL.ClosedSQL
  END AS NewSQL
) AS NewSQL

There are four alternatives for what is appended to Old.NewSQL to create UpdSQL.NewSQL. If neither New.TokType nor Old.TokType are _, then an empty string is appropriate. If M.MatTokType IS NULL, then we are matching to the end since no match was observed (either because there was no match or because we have reached the iteration limit) and should append ClosedSQL. If OldTokType.Replacement specifies a value, then we should throw away the contents of TokSQL.ClosedSQL and substitute the replacement. If none of the previous are true, then append TokSQL.ClosedSQL.

Once the appropriate calculations have been performed, the SELECT clause following the UNION ALL in this recursive CTE is fairly straightforward - it just returns the calculated values in the appropriate positions to match up with the columns of the anchor set.

The actual SELECT clause for the Decommenter_Demo.sql query does the following:

It only selects the most important columns for understanding the behavior (although it is easy to add more if you want to observe more of the internal calculations).
It adds leading and trailing pipe characters to the *SQL fields to make leading and trailing whitespace characters easily identifiable in the output.
It aliases the columns to shorter names in order to make the output easier to view on a small screen.

In the next article, we will examine the comprehensive test suite used to develop and validate this T-SQL Decommenter, as well as documenting the transformations from the test suite query to the production query.

Resources

Rate

Share

Categories

Share

Rate

T-SQL Decommenter Part III

Resources

Rate

Share

Categories

Share

Rate

Related content

A Normalization Primer

Reaching the Outer Limits

Working with email addresses in SQL Server!

Querying Responses

Storing Checkbox Responses as Integers