SQLServerCentral Article

T-SQL Decommenter Part II

,

This is the second part of the series on removing comments from T-SQL code. You can see Part 1 here.

In order to eliminate comments from T-SQL, it is critical to understand the grammar rules of the T-SQL parser as they affect comments. I used SSMS and query execution to clarify my understanding of the rules. There are five main T-SQL elements that must be understood in order to safely remove T-SQL comments:

  • Line Comment: This starts with a double-hyphen (--) and is terminated by the first subsequent newline character (CHAR(10)). Note that the double-hyphen does not initiate a Line Comment if it is nested inside any of the other elements discussed here.
  • Block Comment: This starts with a /* and is terminated by the subsequent matching */. Note that Block Comments supports nesting. For example, in /* /* */ SELECT 1 Foo*/, the SELECT statement will not execute. Similar to Line Comments, a /* does not initiate a Block Comment if it is nested inside any of the other elements (including a Line Comment).
  • String Literal: This starts with and is terminated by a single-quote ('). The sequence '' within the String Literal indicates a literal single-quote within the string and does not terminate the String Literal. Thus, the string '' consisting of two consecutive single-quotes is specified as ''''''. A single-quote within any other element listed here has no special behavior.
  • T-SQL Delimited Identifier: This starts with a left square bracket ([) and is terminated by a right square bracket (]). Additional [ characters within the Identifier do not need to be escaped, but the sequence ]] indicates a literal ] within the Identifier and does not terminate the Identifier. As an example, the column [] (which is a horrible name to give a column) could be specified as [[]]] (or more simply as "[]" using the ANSI Delimited Identifier syntax). A left square bracket within any other element listed here has no special behavior.
  • ANSI Delimited Identifier: This starts with and is terminated by a double-quote ("). The sequence "" within the Identifier indicates a literal double-quote within the Identifier and does not terminate the Identifier. Thus, the table name "" (which is a horrible name to give a table) could be specified as """""" (or more simply as [""] using the T-SQL Delimited Identifier syntax). A double-quote within any other element listed here has no special behavior.

These five token types have the benefit that the query does not need to worry about nesting them within another token type. T-SQL batches can be treated as a sequence of characters with these 5 tokens interspersed. The T-SQL that isn't one of these five token types will be considered to be "Normal" T-SQL. The two Comment token types will need to be removed from the T-SQL, and the other three token types will be retained along with the "Normal" T-SQL. The T-SQL will alternate between Normal sequences, potentially of zero characters, and one of these five token types. The T-SQL always starts and ends with a Normal sequence (again, possibly of zero characters).

In the query, I use char(1) mnemonics to identify these token (element) types. For each token type, a Replacement string is specified, with NULL indicating that the token should be passed through unmodified into the decommented T-SQL. The two comment token types have replacements: Line Comment with a Carriage Return/Line Feed combination and Block Comment with a single space.

TokTypes AS (
  SELECT  TokType,
    Replacement
  FROM ( VALUES
      ('_' /*Normal*/,    NULL),
      ('L' /*Line comment*/,    NCHAR(13)+NCHAR(10)),
      ('B' /*Block comment*/,   N' '),
      ('S' /*String literal*/,  NULL),
      ('T' /*T-SQL identifier*/,  NULL),
      ('A' /*ANSI identifier*/, NULL)
    ) AS V(TokType, Replacement)
),

The query also defines a number of matchers. For each token type that the decommenter has started matching, there is a restricted set of strings that are of interest for matching. For each string that can match, the query needs to know the new token type as well as a depth adjustment value used for token types that support nesting (i.e. Block Comments). In addition, the length of the string matched is of use, so that will be calculated in this CTE. One of the matching strings is a special case - the string EF is used to signify "End of File". When matching a Line Comment (e.g. -- Foo), it is legal to terminate it with either a Line Feed character or with the end of the T-SQL. This stands in contrast to all of the other tokens that are required to be "properly" terminated.

Matchers AS (
  SELECT  OldTokType,
    MatchStr,
    NewTokType,
    DepthAdj,
    ISNULL(LEN(MatchStr), 0) AS MatchLen
  FROM ( VALUES
      ('_', N'--',    'L',  0),
      ('_', N'/*',    'B',  1),
      ('_', N'''',    'S',  0),
      ('_', N'[',     'T',  0),
      ('_', N'"',     'A',  0),
      ('L', NCHAR(10),'_',  0),
      ('L', N'EF',    '_',  0),
      ('B', N'/*',    'B',  1),
      ('B', N'*/',    'B',  -1),
      ('S', N'''',    '_',  0),
      ('S', N'''''',  'S',  0),
      ('T', N']',     '_',  0),
      ('T', N']]',    'T',  0),
      ('A', N'"',     '_',  0),
      ('A', N'""',    'A',  0)
    ) AS V(OldTokType, MatchStr, NewTokType, DepthAdj)
),

The above Table Value Constructors specify all of the token-specific rules used by the T-SQL Decommenter. With those rules in mind, let's walk through how the T-SQL Decommenter removes comments from a sample batch. Download the attached Decommenter_Demo.sql query and execute it. We will first walk through the recursive output one pair of lines at a time. Once the method the query uses to decomment is understood, I will explain the details of the query in a subsequent article.

I have shortened the column names in this query in order to keep the output narrow enough fit on a screen. The columns are:

  • Itr: Iter, the iteration number for the recursive CTE, where 0 is the anchor row
  • mIdx: MatchIndex, the index position for the match within the previous iteration's RemSQL
  • mLen: MatchLen for the identified match from the Matchers CTE
  • mTok: NewTokType for the identified match from the Matchers CTE
  • mDpA: DepthAdj for the identified match from the Matchers CTE
  • Dep: Depth, the depth of nesting for nested tokens (only Block Comments support nesting)
  • Tok: TokType, the new token type to be used in the subsequent iteration for matching
  • PipedActiveSQL: ActiveSQL, the current token as it is iteratively identified by the recursive CTE, surrounded by pipe characters
  • PipedClosedSQL: ClosedSQL, the current token once it has been fully identified by the recursive CTE and is ready to append to NewSQL, surrounded by pipe characters
  • PipedNewSQL: NewSQL, the decommented T-SQL, as it is iteratively created by the recursive CTE, surrounded by pipe (|) characters in order to make leading and trailing whitespace visible
  • PipedRemSQL: RemSQL, the remaining unprocessed T-SQL, surrounded by pipe characters
  • IsF: IsFinished, a boolean indicating whether the recursive decommenting of the T-SQL has completed

With those definitions in mind, let's take a look at how the query decomments a sample batch of T-SQL. Note that it can really help to use a modern SSMS for executing this code with "Retain CR/LF on copy or save" enabled for "Results to Grid" Query Results. This permits copying output columns that have CR/LF sequences and pasting those into another query window to view the contents with the proper line breaks.

The sample batch of T-SQL used in this demo is:

ALTER PROC [F[o]]o]
/*Hi -- <- ignore /*nested*/ more*/AS
SELECT *, 'Str w/ ''quotes''' AS Foo--, Bas
FROM "B""ar";

When decommented, we expect the T-SQL to look like so:

ALTER PROC [F[o]]o]
 
AS
SELECT *, 'Str w/ ''quotes''' AS Foo
FROM "B""ar";

Running Decommenter_Demo.sql will output 18 rows, but we will walk through them in pairs as I explain what is happening.

The first Itr, 0, is the anchor row. The query initializes the match values with NULL (since this anchor row involves no matching). Dep is initialized with 0 and Tok is initialized with _ (Normal). ActiveSQL is initialized with an empty string, ClosedSQL with NULL, NewSQL with an empty string, and RemSQL with the T-SQL to be decommented. Finally, IsF is 0.

  • In Itr 1, Since Tok from the previous iteration is _, the query starts matching Normal SQL and all 5 token types can match their start patterns.
  • The earliest match in the previous RemSQL is for the [ that starts a T-SQL Delimited Identifier at mIdx 12 with a mLen of 1, resulting in mTok of T and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (T) and Dep remains 0.
  • Because Tok from the previous iteration is _, only the contents of the match ([) belong to this newly started T token, and so they are appended to ActiveSQL.
  • Because Tok from the previous iteration is _, all of the SQL preceding the match is Normal SQL and so it is placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 2, since Tok from the previous iteration is T, the query continues matching the T-SQL Delimited Identifier, and there are two possible matches, ] and ]].
  • The earliest match in the previous RemSQL is the same for both ] and ]] at mIdx 4, so the longer match with mLen of 2 is selected, resulting in mTok of T and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (T) and Dep remains 0.
  • Because Tok from the previous iteration and this iteration are both not _, everything up through and including the match belongs to the current T token and is thus appended to ActiveSQL.
  • Because Tok from the previous iteration and this iteration are both not _, ClosedSQL is set to NULL.
  • Because ClosedSQL is NULL, NewSQL is left unmodified.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 3, since Tok from the previous iteration is T, the query continues matching the T-SQL Delimited Identifier, and there are two possible matches, ] and ]].
  • The earliest match in the previous RemSQL is the single ] at mIdx 2 with a mLen of 1, resulting in mTok of _, indicating the termination of the previous token and a return to matching Normal SQL, and mDpA of 0.
  • Because Tok from this iteration is _, ActiveSQL is set to an empty string.
  • Because Tok from this iteration is _, the previous ActiveSQL is concatenated with everything up through and including the match and the result placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 4, since Tok from the previous iteration is _, the query resumes matching normal SQL and all 5 token types can match their start patterns.
  • The earliest match in the previous RemSQL is the /* at mIdx 3 with a mLen of 2, resulting in mTok of B and mDpA of 1, indicating an increase in the nesting level for a Block Comment.
  • Because mDpA is 1, Dep is incremented to 1. Because Dep is greater than 0, Tok is set to mTok (B).
  • Because Tok from the previous iteration is _, only the contents of the match (/*) are appended to ActiveSQL.
  • Because Tok from the previous iteration is _, all of the SQL preceding the match is placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 5, since Tok from the previous iteration is B, the query continues matching the Block Comment, and there are two possible matches, /* and */.
  • The earliest match in the previous RemSQL is the /* at mIdx 17 with a mLen of 2, resulting in mTok of B and mDpA of 1, indicating an increase in the nesting level for a Block Comment.
  • Because mDpA is 1, Dep is incremented to 2. Because Dep is greater than 0, Tok is set to mTok (B).
  • Because Tok from the previous iteration and this iteration are both not _, everything up through and including the match is appended to ActiveSQL.
  • Because Tok from the previous iteration and this iteration are both not _, ClosedSQL is set to NULL.
  • Because ClosedSQL is NULL, NewSQL is left unmodified.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 6, since Tok from the previous iteration is B, the query continues matching the Block Comment, and there are two possible matches, /* and */.
  • The earliest match in the previous RemSQL is the */ at mIdx 7 with a mLen of 2, resulting in mTok of B and mDpA of -1, indicating an decrease in the nesting level for a Block Comment.
  • Because mDpA is -1, Dep is decremented to 1. Because Dep is greater than 0, Tok is set to mTok (B).
  • Because Tok from the previous iteration and this iteration are both not _, everything up through and including the match is appended to ActiveSQL.
  • Because Tok from the previous iteration and this iteration are both not _, ClosedSQL is set to NULL.
  • Because ClosedSQL is NULL, NewSQL is left unmodified.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 7, since Tok from the previous iteration is B, the query continues matching the Block Comment, and there are two possible matches, /* and */.
  • The earliest match in the previous RemSQL is the */ at mIdx 6 with a mLen of 2, resulting in mTok of B and mDpA of -1, indicating an decrease in the nesting level for a Block Comment.
  • Because mDpA is -1, Dep is decremented to 0. Because Dep is now 0, Tok is set to _, indicating the termination of this Block Comment.
  • Because Tok from this iteration is _, ActiveSQL is set to an empty string.
  • Because Tok from this iteration is _, the previous ActiveSQL is concatenated with everything up through and including the match and the result placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration specifies a Replacement, the Replacement (a single space) is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 8, Since Tok from the previous iteration is _, the query resumes matching normal SQL and all 5 token types can match their start patterns.
  • The earliest match in the previous RemSQL is the ' that starts a String Literal at mIdx 17 with a mLen of 1, resulting in mTok of S and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (S) and Dep remains 0.
  • Because Tok from the previous iteration is _, only the contents of the match (') are appended to ActiveSQL.
  • Because Tok from the previous iteration is _, all of the SQL preceding the match is placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 9, since Tok from the previous iteration is S, the query continues matching the String Literal, and there are two possible matches, ' and ''.
  • The earliest match in the previous RemSQL is the same for both ' and '' at mIdx 8, so the longer match with mLen of 2 is selected, resulting in mTok of S and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (S) and Dep remains 0.
  • Because Tok from the previous iteration and this iteration are both not _, everything up through and including the match is appended to ActiveSQL.
  • Because Tok from the previous iteration and this iteration are both not _, ClosedSQL is set to NULL.
  • Because ClosedSQL is NULL, NewSQL is left unmodified.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 10, since Tok from the previous iteration is S, the query continues matching the String Literal, and there are two possible matches, ' and ''.
  • The earliest match in the previous RemSQL is the same for both of them at mIdx 7, so the longer match with mLen of 2 is selected, resulting in mTok of S and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (S) and Dep remains 0.
  • Because Tok from the previous iteration and this iteration are both not _, everything up through and including the match is appended to ActiveSQL.
  • Because Tok from the previous iteration and this iteration are both not _, ClosedSQL is set to NULL.
  • Because ClosedSQL is NULL, NewSQL is left unmodified.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 11, since Tok from the previous iteration is S, the query continues matching the String Literal, and there are two possible matches, ' and ''.
  • The earliest match in the previous RemSQL is the single ' at mIdx 1 with a mLen of 1, resulting in mTok of _ and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (_) and Dep remains 0.
  • Because Tok from this iteration is _, ActiveSQL is set to an empty string.
  • Because Tok from this iteration is _, the previous ActiveSQL is concatenated with everything up through and including the match and the result placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 12, since Tok from the previous iteration is _, the query resumes matching normal SQL and all 5 token types can match their start patterns.
  • The earliest match in the previous RemSQL is the -- that starts a Line Comment at mIdx 8 with a mLen of 2, resulting in mTok of L and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (L) and Dep remains 0.
  • Because Tok from the previous iteration is _, only the contents of the match (--) are appended to ActiveSQL.
  • Because Tok from the previous iteration is _, all of the SQL preceding the match is placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 13, since Tok from the previous iteration is L, the query continues matching the Line Comment, and there are two possible matches, CHAR(10) and End-of-File (indicated by EF).
  • The earliest match in the previous RemSQL is the CHAR(10) that terminates the Line Comment at mIdx 7 with a mLen of 1, resulting in mTok of _ and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (_) and Dep remains 0.
  • Because Tok from this iteration is _, ActiveSQL is set to an empty string.
  • Because Tok from this iteration is _, the previous ActiveSQL is concatenated with everything up through and including the match and the result placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration specifies a Replacement, the Replacement (a CR/LF sequence) is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 14, since Tok from the previous iteration is _, the query resumes matching normal SQL and all 5 token types can match their start patterns.
  • The earliest match in the previous RemSQL is the " that starts an ANSI Delimited Identifier at mIdx 6 with a mLen of 1, resulting in mTok of A and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (A) and Dep remains 0.
  • Because Tok from the previous iteration is _, only the contents of the match (") are appended to ActiveSQL.
  • Because Tok from the previous iteration is _, all of the SQL preceding the match is placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 15, since Tok from the previous iteration is A, the query continues matching the ANSI Delimited Identifier, and there are two possible matches, " and "".
  • The earliest match in the previous RemSQL is the same for both " and "" at mIdx 2, so the longer match with mLen of 2 is selected, resulting in mTok of A and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (A) and Dep remains 0.
  • Because Tok from the previous iteration and this iteration are both not _, everything up through and including the match is appended to ActiveSQL.
  • Because Tok from the previous iteration and this iteration are both not _, ClosedSQL is set to NULL.
  • Because ClosedSQL is NULL, NewSQL is left unmodified.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 16, since Tok from the previous iteration is A, the query continues matching the ANSI Delimited Identifier, and there are two possible matches, " and "".
  • The earliest match in the previous RemSQL is the single " at mIdx 3 with a mLen of 1, resulting in mTok of _ and mDpA of 0.
  • Because mDpA is 0, Tok is set to mTok (_) and Dep remains 0.
  • Because Tok from this iteration is _, ActiveSQL is set to an empty string.
  • Because Tok from this iteration is _, the previous ActiveSQL is concatenated with everything up through and including the match and the result placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL still has contents, IsF remains 0.

  • In Itr 17, since Tok from the previous iteration is _, the query resumes matching Normal SQL and all 5 token types can match their start patterns.
  • None of the 5 token types match their start pattern in RemSQL, which indicates Normal SQL through the end of RemSQL.
  • Because mTok is NULL, Tok is set to _ and Dep remains 0.
  • Because Tok from this iteration is _, ActiveSQL is set to an empty string.
  • Because Tok from this iteration is _, the previous ActiveSQL is concatenated with everything up through and including the match and the result placed in ClosedSQL.
  • Because ClosedSQL is not NULL and Tok from the previous iteration doesn't specify a Replacement, ClosedSQL is appended to NewSQL.
  • Everything up through and including the match is removed from RemSQL.
  • Because RemSQL is now an empty string, IsF is set to 1, which ends the recursion.

At this point, because IsF is 1, the contents of NewSQL are the complete decommented T-SQL. This example shows how the recursive CTE walks through the T-SQL, identifying instances of the five token types and handling them appropriately. In the next article, I will cover the implementation details for the decommenter, going through the exact logic that implements the behavior illustrated above.

Resources

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating