Performance Tips (part1)
There is no shortage of articles and books about improving performance of a database overall, SQL statements, and other types of performance. For example, http://www.mssqlcity.com/Tips/tipTSQL.htm
of the times these sources focus on general recommendations. I
would like to review some interesting aspects of the subject performance in
regards to Transact SQL and a batch process design. My material is based on some
real work I’ve done as a developer, data modeler and process architect. Most
if not all the tips that will be discussed may sound obvious to some of
you. However, judging by the number of widely used applications out there in
which these guidelines were not followed, there are a lot of people who can
benefit from reviewing these tips one more time.
can refer you to my previous article “Overview of performance”
This article will proceed with the number of handy examples, situations, and
solutions to look at the performance of a query, stored procedure, or a process
based on a code changes or design changes. Described solutions are based on a
code changes or design changes. In all cases I am not discussing why the initial
design or coding was done in a specific way but simply trying to show some tips
that may help you in your daily work. Some of the situations and solutions may
look like a simple or as an unnecessary change. But I choose them because I was
experiencing the problems with those scenarios. But, you will be right to point
out that some of the situations may never present any performance issues. And I
will point out that, if you have no live (car, house) insurance you may never
experience a problem until one day when incident happening. And it does not mean
that you should not obtain insurance. So, let’s dive into the real scenarios.
start with a simple example of an infrequently used type of join – CROSS JOIN.
Let’s assume the requirement to produce a Cartesian product of the tables
involved in the join.
SELECT t1.col1, t1.col2, t2.col1,t2.col2 FROM Customer t1 CROSS JOIN GeneralInfo t2 WHERE t1.status > 0
While analyzing this statement I found out about some additional details. First, the table GeneralInfo has only one row. My question to a developer was: “What if there is more then one row?” His answer: “Then only the first row should be in the cross join”. Second, the table GeneralInfo has no relationship to the table CUSTOMER and the values in the
GeneralInfo are independent.
Obviously, the above SELECT statement would produce a wrong result should the table GeneralInfo had indeed more than one row, and, in addition to that, the redundant joint would just harm the performance. A correct solution is to get the first row values into variables and use the variables in SELECT statement. It can be done by using TOP keyword or by specifying PK value. For example,
Declare @var1 varchar(50), @var2 varchar(50) SELECT @var1 = col1, @var2 = col2 FROM GeneralInfo where PKvalue = FirstPKValue -- or, another way -- SELECT TOP -- 1 @var1 = col1, @var2 = col2 FROM GeneralInfo SELECT col1, col2, @var1, @var2 FROM Customer WHERE status > 0
I call it “One statement vs. many statements in batch or stored procedure”. In many cases when a stored procedure or a batch is developed, I observe a tendency to create one SQL statement no matter how big it is, how many joins it requires, or how many conditions are there in the WHERE clause. It very often leads to one the following negative outcomes: the optimizer becomes “confused” and incapable of choosing the best execution plan, or the tables are kept locked longer than necessary. The solution will be to split one query into a few smaller ones. There are many situations where a procedure (process) will reduce running time significantly.
In addition, let’s make an assumption that a query runs for 30 seconds and when the same query is split into 4 smaller statements with partial outputs directed to a temporary table.
Let’s make an assumption that each statement runs 10 seconds. In general, it means that the process (stored procedure) is running longer (4*10 = 40 seconds vs. 30 seconds with original query) but has less impact to each individual table. And what it ultimately means that the procedure performs better, since the performance is not only a function of elapsed time a process (stored procedure) is taking but in a lot of cases the way a process interacts with the other processes in a multi-user environment. In addition, I can assure you, based on my experience, that in many cases one complex SQL statement runs longer than the multiple statements with temporary tables (variables type of table) it is split into.
Let’s take a look at rather common situation when some
historical data must be archived. In our databases history tables don’t have
any physical relations to any other table(s). They are just stand alone tables.
The following are the steps of a typical process many developers have faced at
least once in the carrier:
1. Create an archiving table.
2. Insert records from the history table into archiving table based on the specific condition(s) (usually a date )
3. Delete records from the history table based on the same condition(s) or based on the join between the archiving and the historical tables
Let’s say we need backup Call_Log_Event table that has 3,000,000 rows. There is a data for 120 days and we should keep rows inserted during last 10 days. Based on the scenario above, if table Call_Log_Event structure is next:
Create table Call_Log_Event ( call_log_id int identity(1,1), event_desc varchar(255), event_cd char(6) , eventdt datetime) Create table Call_Log_Event_Archive (archive_id int identity(1,1), call_log_id int, event_desc varchar(255) , event_cd char(6), eventdt datetime, archivedt datetime) insert into Call_Log_Event_Archive (call_log_id, event_desc, event_cd , eventdt, archivedt) select call_log_id, event_desc, event_cd , eventdt, getdate() from Call_Log_Event where datediff(dd,eventdt, getdate()) > 10 delete from Call_Log_Event where datediff(dd,eventdt, getdate()) > 10
Since the number of records in the archiving table is much larger than the number of remaining records in the history table(s) (otherwise it is senseless to archive the data), the following logic is going to significantly reduce the processing time:
1. Create an archiving table.
2. Insert all records from the history table into the archiving table.
3. Truncate the history table
4. Insert the necessary number of records from the archiving table into the history table based on the specified conditions.
5. Delete records for the archived table based on the same condition or based on the join between the archiving and the history tables (if necessary).
Create table Call_Log_Event_Archive (archive_id int identity(1,1), call_log_id int, event_desc varchar(255) , event_cd char(6), eventdt datetime, archivedt datetime) insert into Call_Log_Event_Archive (call_log_id, event_desc, event_cd , eventdt, archivedt) select call_log_id, event_desc, event_cd , eventdt, getdate() from Call_Log_Event truncate table Call_Log_Event insert into Call_Log_Event_Archive (call_log_id, event_desc, event_cd , eventdt, archivedt) select call_log_id, event_desc, event_cd , eventdt, getdate() from Call_Log_Event where datediff(dd,eventdt, getdate()) <= 10 delete from Call_Log_Event_Archive where datediff(dd,eventdt, getdate()) <= 10
Step 4 and 5 may be switched. I placed Insert as a step 4 because step 5 may not be always required but is highly recommended to avoid a future confusion
Many sources suggest avoiding cursors whenever possible. The reasoning behind this is that cursors take a lot of the SQL Server resources and can potentially negatively impact performance. In addition, you have to control cursor deallocation in the case of both successful and unsuccessful statement completion. Here are the implementations of a solution with and without a cursor.
-- Solution with a cursor Begin DECLARE cur1 CURSOR FOR select name from master..sysdatabases open cur1 FETCH NEXT FROM cur1 WHILE @@FETCH_STATUS = 0 BEGIN FETCH NEXT FROM cur1 END CLOSE cur1 deallocate cur1 end -- Solution without a cursor begin declare @cur1 table (name varchar(50), tid int identity(1,1) ) declare @maxid int, @name varchar(50), @minid insert into @cur1 (name) select name from master..sysdatabases select @maxid = max(tid), @minid = min(tid) from @cur1 While (@minid <= @maxid) begin select name from @cur1 where tid = @minid set @minid = @minid + 1 end end
Very often I see procedures that start a transaction at the very top of the procedure and commit/rollback at the very end. I did talk about it in my previous article (http://www.sqlservercentral.com/columnists/lPeysakhovich/overviewofperformance.asp)
Now, let’s review a similar situation. In this case the process has a loop inside and some logic to control processing based on certain criteria. Let’s make an assumption that the loop is necessity of the process.
Create procedure proc1 as Step1 – begin transaction Step2 – loop Step3 – make calculations based on special criteria Step4 – insert/update row Step5 – end loop Step6 – commit/rollback
Is there a problem?
The procedure keeps the transaction active and locks tables at run time much
longer than it should. Now, even though this the approach may be
justified at times, for example when one need to preserve the data changes while
it runs, in most situations this is not the case. For instance, data
verifications/calculations made against values in the tables that are not a part
of modification statements. The second
procedure has a slightly modified logic. It will run for a shorter amount of time even some additional steps are required and at the same time it has less
interference with other users because of shorter locking time – the
transaction is shorter.
Create procedure proc1 As Step1 – create temporary table (or table variable) with structure of permanent table Step2 – loop Step3 – make calculations based on special criteria Step4 – insert row into temporary storage Step5 – end loop Step6 - begin transaction Step7 – insert/update rows into permanent table in one statement Step8 – commit/rollback
Result of the process will be the same; all rows inserted/updated or none of them inserted/updated but the logic performs better. The only added complication here is that the error handler must differentiate between errors inside and outside the transaction context. This can be easily achieved by creating a transaction flag variable, for example.
Performance is not only a function of elapsed time a process (stored procedure) is taking but in a lot of cases the way a process interacts with the other processes in a multi-user environment, produce load to the server. Especially if we are talking about batch processes, data load processes and reports. They can run for a long time with multiple processes in parallel. And different technique methods can be used to minimize time and improve the way processes affect users and/or other processes.