21 February 2012

28162 views

65 0

The Road to Professional Database Development: Set-Based Thinking

Under the pseudonym of 'SwePeso', Peter Larsson is famous on SQL forums for the amazing performance he can get from SQL. How does he do it? In the first of a series of articles, Peter explains his secrets.

The single most common barrier to writing efficient database code, for a developer versed in a typical client language (such as C#), is an inability to leave behind the habit of working on one row at a time, in favor of working with a set (a collection of rows).

What is the difference? Let’s examine a very simple example of summing up 5 rows in a table. The following code creates a table variable (@Sample) and populates it with the rows of data.

DECLARE @Sample TABLE ( Data INT )

INSERT @Sample

( Data )

VALUES ( 1 ),

( 2 ),

( - 4 ),

( 5 ),

( 8 )

SELECT Data

FROM @Sample

For a .NET developer this pseudo-code would be the way to perform this simple sum:

Dim intS AS Integer = 0, 'Initialise the sum to zero

objR AS Row

For Each objR In coll.Rows

intS += objR.Data

Debug.Print intS

This is the standard approach for a procedural language; we not only tell the compiler what we want (the sum of all objects) but also how we want to approach the objects when calculating the sum. In this example, the coll variable holds a collection with objects of type Row. The For-Next loop iterates through all the objects and, for each object, fetches the current value from the Data property and adds the value to the intS variable. In other words the code operates on one row at a time.

T-SQL is a declarative language which means that the programmer is supposed to tell the database server what is required but let SQL Server decide the best way to finalize the task. However, many programmers force the procedural approach on SQL Server, with code such as the following:

DECLARE curSum CURSOR FOR

SELECT Data

FROM @Sample

DECLARE @s INT ,

@Total INT = 0

OPEN curSum

FETCH NEXT

FROM curSum

INTO @s

WHILE @@FETCH_STATUS = 0

BEGIN

SET @Total + = @s

FETCH NEXT

FROM curSum

INTO @s

END

CLOSE curSum

DEALLOCATE curSum

SELECT @Total

Even though this code is written in T-SQL, it still operates on one object (row) at any given time. This approach is commonly referred to as Row-by-Agonizing Row (RBAR). If we increase the number of rows tenfold, the time to complete the task would increase by a factor of ten; for a very large table (100 million rows or more) this sum could easily take hours to complete.

So how do you do this in a set-based way? Luckily for us, T-SQL has an aggregate function available to do the job for us.

1 2	SELECT SUM ( Data ) FROM @Sample

The following table compares the performance of the cursor method, versus the set-based method, for an increasing numbers of rows.

Rows	CURSOR		SET BASED		Performance Factor
	DURATION (mS)	READS	DURATION	READS	DURATION	READS
1k	47	4009	0	7	–	953
10k	492	40004	3	42	164	953
100k	5189	400389	28	388	179	1039
1000k	51358	4003848	286	3847	180	1041

As you can see, both solutions scales linearly i.e. when we increase the number of rows by a factor of ten, we increase the query duration by roughly the same factor. However, there is one big difference, and that is that the set-based solution is about 180 times faster, and puts only about 1/1000th of the read pressure on the server!

So why is the performance difference so vast? We’ll discuss this in full detail in a later article in this series, but in essence, the set-based approach allows SQL Server to return the all the required rows very efficiently, fully exploiting the underlying data storage mechanisms. The internal structure where the data is stored in SQL Server is called a Page and it defines the minimum amount of data read or written to the database, regardless of whether you want all rows in that Page or only one row. So, with a set-based approach, we only need to fetch a page once to get all x rows stored within, whereas with the RBAR approach, we must re-fetch the page x times. We’ll also see in a later article how, with appropriate indexes in place, we can speed up this process still further.

This paradigm shift is all there is to it! SQL Server has some built-in aggregate functions to accomplish the task of summing a number of rows, and many others. The key difference is that you, in set-based thinking, instead of working with individual rows, we work with sets which can contain any number of rows.

To become a professional database developer you have to abandon the old habit of thinking what you want to do with each row, and start thinking instead about what you want to do with the set, by manipulating the columns. Think vertical instead of horizontal!

If we take this example a step further, considering the task of multiplying together all the rows, we will encounter the second biggest barrier to becoming a professional database developer, and that is math skills.

For a .NET developer, this pseudo code would be the way to go:

Dim intP AS Integer = 1, ' Initialise the product to 1

objR AS Row

For Each objR In coll.Rows

intP *= objR.Data

Debug.Print intP

However, we just established that we no longer want to work with single rows, right? Unfortunately, however, while SQL Server provides a handy SUM aggregate function, it does not provide a MULTIPLY aggregate function. So, how do we perform this task in a set-based fashion? This is where your math skills become handy.

1	A * B * C equals e^(ln(A)+ln(B)+ln(C))

Having transformed the multiplication into a sum, it’s easier to see that the final query should look something like this:

1 2	SELECT EXP ( SUM ( LOG ( Data ))) FROM @Sample

But we still have an issue here; logarithmic operations are only allowed on positive values, and not including zero (logarithmic operations on zero and negative values are undefined in natural numbers and will give an Invalid Floating Point operation in T-SQL), and multiplication with zero, under any circumstances, is equal to zero. Also, remember that a multiplication with an odd number of negative numbers results in a negative product and multiplication with an even number of negative number results in a positive product. Here is one example of how to solve this problem:

IF EXISTS( SELECT * FROM @Sample WHERE Data = 0 )

SELECT 0.0E -- If any value is zero, the total product is zero.

ELSE

SELECT CASE IsNegativeProduct

WHEN 1 THEN - EXP ( theSum )

ELSE EXP ( theSum )

END

FROM (

SELECT SUM ( LOG ( ABS ( Data ))), -- Sum all exponents

-- Keep track if product is negative or positive

SUM ( CASE WHEN Data < 0 THEN 1 ELSE 0 END ) % 2

FROM @Sample

) AS d ( theSum , IsNegativeProduct )

The main reason for the checks in the above code is the limitations of the LOG function. So, can we complete the task in another fashion and still produce the correct result? Yes, we can:

DECLARE @p INT = 1

UPDATE @Sample

SET @p * = Data

SELECT @p

This code, which looks very much like the C# pseudo code, produces the right answer, a value of -320, but how did it work? The difference is that we tell T-SQL to multiply the values but we do not force T-SQL to retrieve the rows in any order. Another reason we can use the code above is due to the nature of a product. If we want the product of A and B (a * b), the reverse is also valid (b * a); it doesn’t matter which order we take the values in the multiplication, as long they are of same data type.

This leads us to the third ‘light bulb’ moment on the road to becoming a professional database developer: understanding that rows in a database have no order. This is very different to the way .NET developers think. A collection of objects is normally a double-linked list, where you can traverse forward and backward. Rows in a table can be logically ordered by assigning a sequence number or a sequence denoted by date and time, but you can never rely on the physical order of rows in a set.

Summary

Think in sets and not in single rows
Brush up your math skills as much as you possibly can
Rows have no determined physical order

Coming Next….

Efficient storage and retrieval: Normalization rules.

The Road to Professional Database Development: Set-Based Thinking

Summary

Subscribe for more articles

Rate this article

Peter Larsson

Related articles

Backing Up MySQL Part 7: MySQL Enterprise Backup

Database Subsetting and Data Extraction

Why Developers Need to Understand Execution Plans

Tags