27 June 2017

29218 views

19 0

The Basics of Good T-SQL Coding Style – Part 3: Querying and Manipulating Data

SQL was designed to be a third-generation language, expressed in syntax close to real language, because it was designed to be easy for untrained people to use. Even so, there are ways of expressing SQL Queries and data manipulation in ways that make it easier for the database engine to turn into efficient action. and easier for your colleagues to understand. Robert Sheldon homes in on data querying and manipulation and makes suggestions for team standards in SQL Coding.

The series so far:

If you’ve been around SQL Server for a while, you know that writing T-SQL code is nothing to treat lightly. You must take into account a wide range of considerations to ensure that the code is both accurate and consistent and that it does not break applications or compromise security.

To help with this process, many teams create a set of coding standards that outline acceptable styles and usage, while providing a set of guidelines for addressing specific issues. Such standards not only help to streamline the development process, but also to ease the burden of updating, reviewing, and troubleshooting code, processes that in their own right can represent significant effort.

Unfortunately, it is no small task to develop a comprehensive set of T-SQL standards. The more you dig into the various issues, the more you uncover, resulting in an effort that is often more unwieldy and time-consuming than you had bargained for.

This series seeks to help tame the coding beast by covering many of the issues you should take into account when developing your own standards. The first article focuses on general coding practices that can apply to any type of T-SQL code, and the second article centers on the code used to define database objects. In this article, we cover coding issues related to querying and modifying SQL Server data.

Throughout the article, I provide examples that demonstrate some of the pitfalls developers can run into when working with T-SQL code. The examples are based on the following schema and tables, which I created on a local instance of SQL Server 2016:

CREATE SCHEMA Inventory;

CREATE TABLE Inventory.ProductCategories(

ProductCategoryID INT NOT NULL PRIMARY KEY,

ProductCategoryName NVARCHAR(50) NOT NULL);

CREATE TABLE Inventory.Products(

ProductID INT NOT NULL PRIMARY KEY,

ProductName NVARCHAR(100) NOT NULL,

ProductCategory INT NULL

REFERENCES Inventory.ProductCategories(ProductCategoryID));

You can create the tables in any database that’s convenient for you. The operations we perform are fairly basic and require few resources (although it’s always best to stay away from production servers). With that in mind, let’s get started.

Referencing database objects

We’ll begin by populating the Products and ProductCategories tables, using the following two INSERT statements:

INSERT INTO Inventory.ProductCategories

VALUES(1, 'Category 1'), (2, 'Category 2'), (3, 'Category 3');

INSERT INTO Inventory.Products

VALUES(101, 'Product 101', 1), (102, 'Product 102', 2), (103, 'Product 103', 3);

The INSERT statements already point to our first issue. When using these statements, you should specify the column names, even if you’re inserting data into all columns. Not only does this make it easier to verify that the data is targeting the correct columns, but it also ensures that statements are more resilient to changes to the table definition. For instance, if a column were added to the Products table, the second INSERT statement would fail.

We need to be just as specific when querying data, making sure all object references are complete. As an example, consider the following SELECT statement:

1	SELECT * FROM Products;

Notice that the statement provides no schema when referencing the table name. Unless the table resides in the default schema, the database engine will return an error stating that Products is an invalid object. You should include the schema whether or not the table (or other object) resides in the default schema, and you should do so in any statement that that references the object, whether a SELECT, DELETE, UPDATE, or other type of statement.

In addition, if you’re referencing an object in a remote database, you must provide the fully qualified name, which includes the server and database, along with the schema and object. For example, the following fully qualified name references the Products table in the Inventory schema, which is in the StoreDB database running on the Server01 SQL Server instance:

1	Server01.StoreDB.Inventory.Products

Another issue with the above SELECT statement is the use of the asterisk wildcard in the select list. Because the wildcard represents all columns, it is tempting to use it when you want to return every column in a table or view. However, table and view definitions can change and you can suddenly find yourself with broken applications. Except for the occasional ad hoc query, you should avoid using a wildcard in this way and specify each column in the select list.

Now let’s look at a SELECT statement that raises several other coding issues:

SELECT ProductID, ProductName, ProductCategoryName

FROM Inventory.Products INNER JOIN Inventory.ProductCategories

ON ProductCategory = ProductCategoryID

ORDER BY 1 DESC;

This time we’re joining the Products and ProductCategories tables, returning values from three columns in those tables, and sorting the results by the ProductID column, which is represented by a column ID of 1.

When querying multiple tables in this way, you should assign an alias to each table to make it easier to reference that table in other parts of the code. You should then use those aliases to qualify any column references, something we fail to do here. We can get away with not qualifying the column names because no duplicate names exist between the two tables, but that is not always the case. And even if there are no duplicate columns, you should still qualify the names so it is clear where the data is coming from. This makes the code more readable and easier to troubleshoot.

Another issue with the SELECT statement is the use of a column number in the ORDER BY clause, rather than the column name. This can cause confusion and lead to errors. For example, someone might change the order of the columns in the select list without accounting for the numbers used in the ORDER BY clause.

Accessing SQL Server data

When developing or reviewing T-SQL code that manipulates data, you can encounter a variety of issues related to stylistic consistency, code accuracy, and performance. In the next article, we’ll cover performance-related issues. Our focus here is on style and accuracy, but before we get started, we’ll add more sample data to the Products table:

INSERT INTO Inventory.Products(ProductID, ProductName, ProductCategory)

VALUES(104, 'Product 104', 1), (105, 'Product 105', 2), (106, 'Product 106', NULL),

(107, 'Product 107', 1), (108, 'Product 108', NULL), (109, 'Product 109', 3);

After adding the data, let’s create a SELECT statement that joins the Products and ProductCategories tables:

SELECT TOP(6) p.ProductID, p.ProductName,

CASE pc.ProductCategoryName

WHEN 'Category 1' THEN 'Cat1'

WHEN 'Category 2' THEN 'Cat2'

WHEN 'Category 3' THEN 'Cat3'

END AS ProductCategory,

p.ProductName + '_' + p.ProductID AS ProductAltID

FROM Inventory.Products p WITH(NOLOCK)

INNER JOIN Inventory.ProductCategories pc WITH(NOLOCK)

ON p.ProductCategory = pc.ProductCategoryID;

Unfortunately, if we try to run this statement, we’ll receive the following conversion error:

1	Conversion failed when converting the varchar value 'Product 101_' to data type int.

The problem is that we’re trying to concatenate the underscore and ProductName column, defined with the NVARCHAR data type, with the ProductID column, defined with an INT data type. Because we start with the string values, you might assume that the database engine will implicitly convert the numeric value to a string. However, the INT data type takes precedence of the NVARCHAR data type, so the database engine instead tries to convert the string to a numeric value, which results in an error.

Data type precedence is only one of many issues you can run into when converting data, which is one reason it’s a good idea to explicitly convert data. You can find more information about data-conversion issues in my article How to Get SQL Server Data-Conversion Horribly Wrong.

The above SELECT statement also raises a number of other issues. For example, it includes the TOP operator in the select list, but does not include an ORDER BY clause, making the results more unpredictable.

The statement also contains a CASE expression that does not include an ELSE block. Although the ELSE block is not required, it is often a good idea to include one to handle unexpected values. At the very least, you should keep this issue in mind when reviewing your code.

Finally, the SELECT statement uses the NOLOCK table hint when referencing each table. The table hint is equivalent to the READ UNCOMMITTED isolation level, which allows a statement to read rows that have been modified by other transactions but not yet committed. Developers will sometimes use NOLOCK or READ UNCOMMITTED to improve query performance, but this can result in dirty reads, so you need to be cautious when using either one.

That said, if you are okay with risking dirty reads, READ UNCOMMITTED is generally considered the better option of the two because it provides more precise control over your isolation levels.

You also want to be careful about your use of subqueries. Although most issues with subqueries are related to performance (which we’ll cover in the next article), subqueries can also present other issues. For example, the following SELECT statement uses a subquery in the WHERE clause, along with the NOT IN operator:

SELECT ProductID, ProductName, ProductCategory

FROM Inventory.Products

WHERE ProductCategory NOT IN (SELECT ProductCategory

FROM Inventory.Products);

You need to be cautious when using the IN or NOT IN operator if the subquery’s source data contains NULL values. In this case, the subquery itself returns both numeric and NULL values, but the outer SELECT statement returns an empty data set. To get the data you need, you should consider using a NOT EXISTS operator instead of NOT IN or recast the statement as a left outer join.

You should also ensure your subquery is written correctly and returns the right data. For example, the subquery in the following SELECT statement can return more than one value and includes an ORDER BY clause:

SELECT ProductID, ProductName, ProductCategory

FROM Inventory.Products

WHERE ProductCategory = (SELECT ProductCategory

FROM Inventory.Products ORDER BY ProductCategory);

If a subquery is supposed to return a scalar value, you better be sure that’s what it will always do. In addition, be sure not to include an ORDER BY clause in a subquery unless it also includes the TOP operator in the select list.

There are, of course, numerous other concerns to be aware of when working with subqueries. For more information, see my article Subqueries in SQL Server.

Filtering data

In the previous two SELECT statements, we used the subqueries in the WHERE clause as part of filtering the data. Although these examples focused on the use of subqueries, there are a number of other issues you can run into when filtering data.

For example, you have to be careful when using logical operators to define multiple conditions in your WHERE clause, as in the following SELECT statement:

SELECT ProductID, ProductName, ProductCategory

FROM Inventory.Products

WHERE ProductID < 103 OR ProductID > 107 OR ProductID = 106

AND ProductCategory IS NOT NULL;

The SELECT statement uses both the OR and the AND logical operators, giving us the following results:

ProductID	ProductName	ProductCategory
101	Product 101	1
102	Product 102	2
108	Product 108	NULL
109	Product 109	3

In this case, we’ve simply specified the various operators and expressions without trying to control the logic, essentially taking any results the database engine wants to feed us. In this case, the database engine returns products that meet either of the first two conditions or products that have a ProductID value of 106 and a ProductCategory value that is not NULL.

When we mix logical operators, we must carefully control how the conditions are applied. What we’re really after here is to return the products that meet any of the first three conditions and eliminate any rows with a ProductCategory value of NULL. To do so, we can enclose the first three expressions in parentheses:

SELECT ProductID, ProductName, ProductCategory

FROM Inventory.Products

WHERE (ProductID < 103 OR ProductID > 107 OR ProductID = 106)

AND ProductCategory IS NOT NULL;

Now we get the results we want, without rows that contain a ProductCategory value of NULL:

ProductID	ProductName	ProductCategory
101	Product 101	1
102	Product 102	2
109	Product 109	3

While we’re on the topic of WHERE clauses, don’t forget to include them in your DELETE and UPDATE statements, unless you don’t mind incurring the wrath of everyone around you. For example, a statement such as the following will delete every row in the Products table:

1	DELETE Inventory.Products;

Also watch for statements that inadvertently create Cartesian products because they include no WHERE clause. For example, the following join is based on pre-ANSI SQL-92 standards, in which the join condition is normally defined in the WHERE clause:

1 2	SELECT p.ProductID, p.ProductName, pc.ProductCategoryName FROM Inventory.Products p, Inventory.ProductCategories pc;

Because there is no WHERE clause to limit the results, the statement will return 27 rows, with each row in the first table matched to every row in the second table. This number might not be a big deal here, but what if your tables contain millions of records?

The same thing happens when we do a cross join without specifying a WHERE clause:

1 2	SELECT p.ProductID, p.ProductName, pc.ProductCategoryName FROM Inventory.Products p CROSS JOIN Inventory.ProductCategories pc;

Again, we end up with 27 rows, but if you were querying massive tables in a production environment, you could bring your system to a standstill.

Working with table structures

When developing your coding standards, you should also address the use of temporary tables, table variables, and common table expressions (CTEs), providing guidelines that explain which type to use when. For example, in the following code, the first statement creates a temporary table, and the second statement uses that table to join the Products table to the temporary table:

SELECT p.ProductCategory AS CategoryID,

CASE

WHEN pc.ProductCategoryName IS NULL THEN 'No category'

ELSE pc.ProductCategoryName

END AS CategoryName,

COUNT(p.ProductID) AS ProductCount

INTO #ProductCounts

FROM Inventory.Products p

LEFT JOIN Inventory.ProductCategories pc

ON p.ProductCategory = pc.ProductCategoryID

GROUP BY p.ProductCategory, pc.ProductCategoryName;

SELECT pr.ProductID, pr.ProductName, ct.CategoryName

FROM Inventory.Products pr

LEFT JOIN #ProductCounts ct

ON pr.ProductCategory = ct.CategoryID

WHERE ct.ProductCount > 2;

When we run these statements, we get the following results:

ProductID	ProductName	CategoryName
101	Product 101	Category 1
104	Product 104	Category 1
107	Product 107	Category 1

Whenever you implement a temporary table structure in this way, you should evaluate whether the best solution is to use a temporary table, table variable, or CTE, any of which will work in this case. However, you need to clearly understand the differences between them so you can decide which is best in certain circumstances.

For example, temporary tables and table variables are both written to tempdb, so there’s additional overhead that comes with them. CTEs are not written to tempdb. In addition, you can create temporary tables at a global or local scope, and the table persists until it is explicitly dropped or the session is terminated. A table variable exists only within the scope of the current batch, stored procedure, or user-defined function. A CTE exists only within the scope of the statement it precedes. You can also create indexes on temporary tables, but not on table variables or CTEs.

There are, in fact, a number of differences between the three. To a certain degree, you can think of a temporary table more like a regular table, a CTE closer to a view, and a table variable more like other variables. But these are generalities. Know that there is a lot more to them than just that.

Working with transactions

A transaction is a sequence of T-SQL statements that are executed as a single logical unit. You can explicitly define a transaction by enclosing the T-SQL statements within a BEGIN TRANSACTION…COMMIT TRANSACTION code block, as shown in the following stored procedure definition:

CREATE PROCEDURE Inventory.AddProduct

@ProductID INT,

@ProductName NVARCHAR(100) = '',

@ProductCategory INT = NULL

SET NOCOUNT ON

BEGIN TRANSACTION

INSERT INTO Inventory.Products(ProductID, ProductName, ProductCategory)

VALUES(@ProductID, @ProductName, @ProductCategory)

COMMIT TRANSACTION;

Although the database engine will create and run the stored procedure with no problem, the definition itself is missing two important components: rollback logic and error handling.

When defining a transaction, you should take advantage of the T-SQL elements that let you effectively control that transaction, including the ability to roll back all or part of the transaction. You can follow a BEGIN TRANSACTION statement with one or more ROLLBACK TRANSACTION statements that let you return to the beginning of the transaction or to specific savepoints within the transaction. You can also nest transactions to better control execution logic.

You should also integrate error handling into this logic so you can take specific actions and log relevant information should an error occur. For this, you can use a TRY…CATCH block that controls the statement flow in the event of an error. For an overview of error handing in SQL Server, you might want to check out my article Handling Errors in SQL Server 2012. Most of the basics have remained unchanged through SQL Server 2016.

Running the EXECUTE statement

Now that we’ve introduced stored procedures, we should also touch upon the EXECUTE statement. In the following example, the EXECUTE statement calls the procedure from the previous example, passing in two parameter values:

1	EXECUTE Inventory.AddProduct 110, 'Product 110';

The fact that we’re passing in only two parameter values is important to note because the stored procedure is defined with three. The procedure will still run, but it will insert a NULL value into the ProductCategory column. Even if this is what we want, we should still explicitly include the value when calling the procedure so there is no doubt about the intent.

Related to this issue is the importance of including the parameter name when providing the value. This helps to ensure that the intended value is being mapped to the correct parameter, making the code easier to review and to avoid unnecessary errors. It will also help to ensure you’re not including extra values when calling the procedure or passing in a value that is not consistent with a parameter’s data type (although it doesn’t actual prevent either possibility).

While we’re on the topic of the EXECUTE statement, keep in mind that you should not use it to call dynamic SQL, whether within a stored procedure or directly, as in the following example:

DECLARE @sql VARCHAR(1000);

DECLARE @id INT;

SET @id = 101;

SET @sql = 'SELECT ProductName FROM Inventory.Products

WHERE ProductID = ' + CAST(@id AS VARCHAR(10)) + ';';

EXECUTE (@sql);

Although the database engine lets us execute dynamic SQL in this way, doing so can make your system susceptible to SQL injection attacks because user input can run directly against the database. For this reason, many database folks recommend that you avoid dynamic SQL altogether.

However, if you can’t get around implementing dynamic SQL, you should use the sp_executesql system stored procedure to execute the code, rather than an EXECUTE statement. The sp_executesql stored procedure also supports strongly typed variables and tends to be more efficient.

Getting the code right

Clearly, you must take into account a number of factors when developing T-SQL standards that address issues related to manipulating SQL Server data. As with any aspect of T-SQL coding, you want to create definitive guidelines that will help team members develop readable, consistent, and accurate code that does not break applications or introduce security risks.

As with the previous articles in this series, what we’ve covered here is meant only to provide a starting point to help you to understand the types of issues to address when creating your own standards. You should include whatever specifics you believe relevant to your organization’s development efforts, keeping in mind that your standards are a work-in-progress, evolving as the team changes and as SQL Server and T-SQL are modified.

The Basics of Good T-SQL Coding Style – Part 3: Querying and Manipulating Data

The series so far:

Referencing database objects

Accessing SQL Server data

Filtering data

Working with table structures

Working with transactions

Running the EXECUTE statement

Getting the code right

Subscribe for more articles

Rate this article

Robert Sheldon

Related articles

The Basics of Good T-SQL Coding Style – Part 4: Performance

Snake draft sorting in SQL Server, part 2

Tags