T-SQL Decommenter Part IV

Building a complicated query like this and ensuring that it has the correct behavior may not be easy, but it is achievable when guided by a comprehensive test suite. When I set out to write this query, I started by developing a test suite and only then started writing code. As I explored the T-SQL syntax, I continued to add test cases to validate edge cases. In general, I prefer to err on the side of having too many tests rather than too few, especially for code where there are wide variety of permutations possible for the input. The full test suite along with the decommenter rigged to run against it is in the attached file Decommenter_Test.sql.

The tests in the suite are defined in the temp table #tests, which receives two populations of tests. The columns in #tests are:

ID, a unique identifier for each test
OrigSQL, the SQL to be decommented
ValidSQL, the correctly decommented SQL against which the query is validated
MaxIter, the iteration limit for the test.

The first INSERT into #tests contains hundreds of tests for a wide variety of scenarios. Here is a drastically redacted sample of this test population:

INSERT INTO #tests (ID, OrigSQL, ValidSQL, MaxIter)
SELECT  ID,
  CAST(OrigSQL AS nvarchar(max)) AS OrigSQL,
  CAST(ValidSQL AS nvarchar(max)) AS ValidSQL,
  100 AS MaxIter
FROM ( VALUES
    (0, NULL, NULL),
    (1, '', ''),
    (2, '  ', '  '),
    (3, 'S', 'S'),
    (108, ' /*Foo*/', '  '),
    (134, '/* Foo */'+CHAR(13)+CHAR(10)+
          'SELECT *'+CHAR(13)+CHAR(10)+
          '/* Bar */'+CHAR(13)+CHAR(10)+
          'FROM Bar',
        ' '+CHAR(13)+CHAR(10)+
          'SELECT *'+CHAR(13)+CHAR(10)+
          ' '+CHAR(13)+CHAR(10)+
          'FROM Bar' ),
    (203, '--'+CHAR(13)+CHAR(10)+'-Foo-', CHAR(13)+CHAR(10)+'-Foo-'),
    (310, '[]/**/', '[] '),
    (311, '[/**/]', '[/**/]'),
    (312, '/**/[]', ' []'),
    (416, '''/*Foo*/', '''/*Foo*/'),
    (530, '"/**/"""', '"/**/"""'),
    (950, REPLICATE('/* */ ', 50), REPLICATE('  ', 49) + '/* */ '),

    (NULL, NULL, NULL)
  ) AS V(ID, OrigSQL, ValidSQL)
WHERE ID IS NOT NULL;

This batch of tests focuses on correctly and completely decommenting the T-SQL, so it uses a MaxIter of 100 for all of the tests. The one exception to completely decommenting the T-SQL is the final test, 950, which is used to validate that MaxIter is indeed set to 100 and that MAXRECURSION at the end of the query is not set below 100. Also note that I'm lazy and don't like having to remember to remove the comma on the last test, so the final (NULL, NULL, NULL) is the record without a trailing comma and is excluded by the WHERE clause.

The second batch of tests focuses on how the decommenter degrades when the iteration limit is reached. All of the tests in this batch share the same OrigSQL, but as MaxIter drops the ValidSQL begins to include comments. At MaxIter 19 and above, all of the comments are removed. As MaxIter drops to 18 and lower, the decommenter begins to leave comments at the end of the output as it runs out of iterations to remove the comments.

The test harness consists of the decommenter logic hooked up with #tests as the source for the anchor rows. A fairly comprehensive list of columns is included in the Decomment recursive CTE to support debugging. At the end of the whole query, all output rows from Decomment are checked to see if the test passes.

FROM (  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY ID ORDER BY Iter DESC) AS RowN
  FROM Decomment
    ) AS D
  CROSS APPLY ( SELECT
    CASE
        WHEN D.RowN = 1
      AND D.IsFinished = 1
      AND D.Iter <= D.MaxIter
      AND ( CAST(D.NewSQL AS varbinary(max)) = CAST(D.ValidSQL AS varbinary(max))
        OR ( D.NewSQL IS NULL AND D.ValidSQL IS NULL ) 
          ) THEN 1
        WHEN D.RowN > 1
      AND D.IsFinished = 0
      AND D.Iter <= D.MaxIter THEN 1
        ELSE 0
    END AS TestPassed
  ) AS C1

The first thing is to determine whether a given row is the final iteration. This is handled by using ROW_NUMBER() on the Decomment output, partitioned by ID and ordered on Iter in descending order. When RowN is 1, it is the final iteration for that test.

The logic for determining TestPassed has two main forks, one for the final iteration for a given test (RowN = 1) and one for all of the prior iterations.

For the final iteration, all of the following must be true for the test to pass: IsFinished must be 1, Iter must not exceed MaxIter, and D.NewSQL and D.ValidSQL must be identical (including trailing spaces, case, and so forth, checked using a cast to varbinary(max)), or both must be NULL.

For the prior iterations, the test is easier: IsFinished must be 0 and Iter must not exceed MaxIter.

The WHERE clause isolates the rows where the test did not pass (C1.TestPassed = 0). There is a commented out WHERE clause illustrating debugging. If a test fails, substitute the failing test ID into the alternative WHERE clause and toggle which clause is commented out. All of the iterations for the failing test will be displayed, making it easier to understand the issue. This can also be used to investigate how the decommenter iteratively processes various tests.

The Decommenter_Test.sql file is the definitive version of the decommenter - it's what I work from whenever I'm working on the decommenter code, because I need to know immediately if a change breaks any tests. However, to be useful, I want a decommenter that runs against sys.sql_modules. That version is is SQL_Search_Decommenter.sql, which was developed from Decommenter_Test.sql. Whenever I update Decommenter_Test.sql, I compare the two files and then make similar changes in SQL_Search_Decommenter.sql to bring it into line. There are a number of differences between the two, both to adapt the code to sys.sql_modules and to improve execution performance.

First, SQL_Search_Decommenter.sql starts with a DECLARE block. I use the DECLARE block to keep the commonly adjusted query parameters in one place at the start of the query, making them easy to set.

DECLARE
  @NamePat nvarchar(1000) = '%',
  @SQLPat nvarchar(1000) = '%',
  @DecommentedOnly bit = 0,
  @MaxIter int = 32767;

@NamePat: Matches the 2-part name for the object (e.g. to match v_Foo*, either use '%v_Foo%' or use 'dbo.v_Foo%', or to be precise 'dbo.v[_]Foo%'). Leave as '%' to skip filtering by object name.
@SQLPat: Matches the SQL. For instance, to search for tbl_Foo, use '%tbl_Foo%' (or, to be precise, '%tbl[_]Foo%'). By default, this will return objects that match in the commented SQL, whether or not the decommented SQL matches, with the DecommentedHasMatch column indicating whether a match was found in the decommented SQL.
@DecommentedOnly: Switch from 0 to 1 to return only objects that match in the decommented SQL.
@MaxIter: Controls how many iterations to perform in the recursive query when decommenting. The Iter column in the output indicates how many iterations were required to decomment the code. Setting this to a lower value will improve performance for lengthy objects that have lots of comments, string literals, and delimited (both " and [) identifiers, but at the expense of leaving comments in the decommented code once the limit is reached.

Next is the SourceSQL Common Table Element (CTE), which handles retrieving module definitions and their two-part object names. It specifies name filtering using @NamePat and filtering of the commented SQL using @SQLPat. Note that there is a subtle bug here caused by matching the commented SQL prior to decommenting. Under some circumstances, @SQLPat can match a string that shows up only in the decommented SQL. For instance, the pattern '%Long_Column_Name AS ShtColNm%' wouldn't match the commented SQL 'SELECT Long_Column_Name/**/AS ShtColNm', but would match it after the decommenter replaces /**/ with a single space. The query could defer all @SQLPat filtering until after all of sys.sql_modules has been decommented, but that would have a substantial negative impact on performance. The query could be further complicated to detect patterns that might suffer from this issue (hint - they all have one or more of the Replacement strings in them) and only defer matching in that scenario, coupled of course with OPTION (RECOMPILE). I think I'll live with the subtle bug and just avoid triggering it!

SourceSQL AS (
  SELECT  O.[object_id] AS ID,
    C1.ObjName,
    SM.[definition] AS OrigSQL
  FROM sys.sql_modules AS SM
    JOIN sys.objects AS O
      ON SM.[object_id] = O.[object_id]
    JOIN sys.schemas AS S
      ON O.[schema_id] = S.[schema_id]
    CROSS APPLY ( SELECT
      S.[name] + N'.' + O.[name] AS ObjName
    ) AS C1
  WHERE C1.ObjName LIKE @NamePat
    AND SM.[definition] LIKE @SQLPat
),

This is straight forward, except for the use of CROSS APPLY to avoid redundant calculation of ObjName. It has negligible performance impact, but I dislike repeating myself in code.

The TokTypes and Matchers CTEs are identical to those in Decommenter_Test.sql. There are some columns left out of the SELECT clauses of the Decomment CTE. As explained in the previous article, only the columns up through and including RemSQL are required for the Decomment CTE to function. All of the subsequent columns are only included in Decomment_Test.sql to facilitate debugging, and so they are omitted in SQL_Search_Decommenter.sql in order to reduce the size of the Worktable and improve performance. Finally, the one reference in the Decomment CTE to Old.MaxIter is replaced with @MaxIter (since there is no need to control the iteration limit on a per-test basis).

An additional CTE, Decommented, is added near the end to select only the final iteration for each anchor row, to use ID to look up the ObjName and Definition columns from the SourceSQL CTE, and to provide a clean interface for accessing the output from Decomment.

Decommented AS (
  SELECT  D.ID,
    S.ObjName,
    CASE WHEN D.NewSQL LIKE @SQLPat THEN 1 ELSE 0 END AS DecommentedHasMatch,
    S.OrigSQL AS [Definition],
    NewSQL AS Decommented,
    Iter
  FROM Decomment AS D
    JOIN SourceSQL AS S
      ON D.ID = S.ID
  WHERE D.IsFinished = 1
)

Finally, there is the actual SELECT clause that generates the result set:

SELECT  ObjName,
  DecommentedHasMatch,
  CASE WHEN DATALENGTH([Definition]) <= 65535 * 2 THEN [Definition]
    ELSE NULL END AS [Definition],
  CASE WHEN DATALENGTH([Definition]) <= 65535 * 2 THEN Decommented
    ELSE NULL END AS Decommented,
  CASE WHEN DATALENGTH([Definition]) > 65535 * 2 THEN
      CAST(N'<?def --' + NCHAR(13)+NCHAR(10) + [Definition] +
        NCHAR(13)+NCHAR(10) + N'--?>' AS xml)
    ELSE NULL END AS Definition_in_XML,
  CASE WHEN DATALENGTH([Definition]) > 65535 * 2 THEN
      CAST(N'<?def --' + NCHAR(13)+NCHAR(10) + Decommented + 
        NCHAR(13)+NCHAR(10) + N'--?>' AS xml)
    ELSE NULL END AS Decommented_in_XML,
  Iter
FROM Decommented
WHERE (@DecommentedOnly = 0 OR DecommentedHasMatch = 1)
ORDER BY ObjName
OPTION (MAXRECURSION 32767);

Because SSMS won't retrieve more than 65,535 characters from non-XML text blocks, if the DATALENGTH (which counts trailing spaces) for [Definition] is greater than twice 65,535 (nvarchar values have two bytes per character), then [Definition] and Decommented are both output as NULL and the SQL is embedded in XML columns (since SSMS can retrieve XML of unlimited size).
The filter on DecommentedHasMatch is controlled by @DecommentedOnly.
MAXRECURSION is set to 32767 since @MaxIter will limit the recursion.

This completes the discussion of the T-SQL Decommenter. I hope that you find the code useful and that you learned something along the way!

Resources

Rate

Share

Categories

Share

Rate

T-SQL Decommenter Part IV

Resources

Rate

Share

Categories

Share

Rate

Related content

A Normalization Primer

Reaching the Outer Limits

Working with email addresses in SQL Server!

Querying Responses

Storing Checkbox Responses as Integers