SQLServerCentral Article

The Joy of Numbers

,

Introduction

"You must unlearn what you have learned" - Yoda

What exactly is a numbers table and more importantly, what can it do for you? To put it simply, a numbers table is

cheap (read: "free") tool that should be part of every SQL developer's toolkit. In its simplest form, a

numbers table is a table with one INT column. That single column, which also serves as

the primary key and clustered index, contains the natural (or "counting") numbers starting with 1 and going

up to some specified upper limit. We'll begin by describing how to create a numbers table, and then move on to a

discussion of why they are so useful. As always, I'll throw in some sample code (for SQL Server 2000 and

SQL Server 2005) so you can play along at home.

Make Your Own Numbers Table!

Let me start by saying that every database should have its own numbers table. They're very useful (as we'll

see later), and extremely easy to create. We'll start with the creation of the numbers table. This will work in

SQL Server 2000 or SQL Server 2005:

-- Create the numbers table
CREATE TABLE dbo.Numbers (Num INT NOT NULL PRIMARY KEY CLUSTERED);
GO

Well that was simple enough. Now comes the fun part - populating it. There are lots of methods to populate the

numbers table, but we'll start with the most obvious solution first. This method is a simple

WHILE loop that will populate our numbers table one row at a time:

DECLARE @i INT;
SELECT @i = 1;
WHILE @i <= 10000
BEGIN
    INSERT INTO dbo.Numbers(Num) VALUES (@i);
    SELECT @i = @i + 1;
END;

This simple procedural solution works on both SQL 2000 and SQL 2005. There are numerous other clever ways to

create a numbers table. Here's one of my personal favorites (thanks to Jeff Moden and all others who posted it

for public consumption). This method uses the Transact-SQL SELECT INTO

statement and IDENTITY function to simultaneously create and populate

the table, so we have to drop the previously created numbers table. This example works on SQL 2000 or SQL 2005:

-- Be sure to drop the numbers table if it exists
DROP TABLE dbo.Numbers;
GO
-- Now re-create it and fill it with sequential numbers starting at 1
SELECT TOP 10000 IDENTITY(INT,1,1) AS Num
INTO dbo.Numbers
FROM master.INFORMATION_SCHEMA.COLUMNS i1
CROSS JOIN master.INFORMATION_SCHEMA.COLUMNS i2;
GO
-- Add a primary key/clustered index to the numbers table
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Num);
GO

Notice that this set-based method offers a significant speed improvement over the WHILE loop version. This simple

query relies on the fact that a self-CROSS JOIN creates an exponential result set. If your

master.INFORMATION_SCHEMA.COLUMNS table has over 400 rows in it (not unreasonable), the CROSS JOIN

will generate over 160,000 rows. If you need more just slap another CROSS JOIN in there. We're using the

TOP keyword on the SELECT statement to limit the results to 10,000 rows for our purposes though.

Feel free to populate your numbers table with as many numbers as you need up to the upper limit for an INT,

which happens to be 2,147,483,647.

Now that we have a numbers table it's time to put it to work solving some problems.

The Numbers Table in Action

The power of the numbers table is that it can be used to convert procedural solutions to set-based solutions.

Consider the ubiquitous comma-delimited string-split function. As before we'll start with the procedural version

which basically consists of a WHILE loop. This code works on

SQL 2000 and SQL 2005:

CREATE FUNCTION dbo.fnProceduralSplit(@string VARCHAR(8000))
RETURNS @ret TABLE (
    Num INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED,
    String VARCHAR(255))
AS
BEGIN
    DECLARE @i INT, @j INT;
    SELECT @i = 1;
    SELECT @j = CHARINDEX(',', @string);
    WHILE (@j > 0)
    BEGIN
        INSERT INTO @ret (String)
        VALUES (SUBSTRING(@string, @i, @j - @i));
        SELECT @i = @j + 1;
        SELECT @j = CHARINDEX(',', @string, @i);
    END;
    INSERT INTO @ret (String)
    VALUES (SUBSTRING(@string, @i, LEN(@string) - @i + 1));
    RETURN;
END
GO

All this code does is loop through the string passed in as a parameter looking for commas. It breaks out the

strings between any commas it finds. It's probably about as efficient as you can get for procedural code. So

now let's see what the equivalent function looks like in a set-based version on SQL 2000 and SQL 2005.

In this version we'll let SQL Server do all the heavy lifting. We'll use an inline table-valued function

for performance, and we'll return the number representing the position of each comma in the list to keep the

relative ordering of strings in the list:

CREATE FUNCTION dbo.fnSetSplit (@String VARCHAR(8000))
RETURNS TABLE
AS
RETURN (
    SELECT Num,
        SUBSTRING(@String,
            CASE Num
                WHEN 1 THEN 1
                ELSE Num + 1
            END,
            CASE CHARINDEX(',', @String, Num + 1)
                WHEN 0 THEN LEN(@String) - Num + 1
                ELSE CHARINDEX(',', @String, Num + 1) - Num -
                    CASE 
                        WHEN Num > 1 THEN 1
                        ELSE 0
                    END
               END
            ) AS String
    FROM dbo.Numbers
    WHERE Num <= LEN(@String)
        AND (SUBSTRING(@String, Num, 1) = ','
            OR Num = 1)
)

You can test the functionality of your new UDF's with queries like the following:

SELECT * FROM dbo.fnSetSplit ('Jack,Jill,Hill,Water');
SELECT * FROM dbo.fnSetSplit ('Alexander The Great');
SELECT * FROM dbo.fnSetSplit ('Red,Green,Blue');
SELECT * FROM dbo.fnProceduralSplit ('Army,Navy,Air Force,Marines');
SELECT * FROM dbo.fnProceduralSplit ('Washington,Jefferson,Adams');
SELECT * FROM dbo.fnProceduralSplit ('1,2,3,4,5,6,7,8');

On an individual basis, for single queries like those above you might not notice a large difference in performance,

but for thousands of such queries in a short time period, the difference can be significant. I ran several rounds

of comparisons splitting 100,000 randomly generated comma-separated lists of "words"*. Here

are the averages:

 SQL 2000SQL 2005
Procedural Function130,783 ms86,033 ms
Set-based Function55,390 ms51,423 ms**

Notes:

*The "words" were acutally randomly generated strings from 1 to 20 characters in length.

**In SQL 2005, using the CROSS APPLY operator reduced the time of the

set-based test to 27,296 ms.

Of course performance may vary on your system, but the results I got show a definite improvement when using

the set-based version over the procedural version.

Here Comes the Second Example

OK, still not convinced? Then let's try another example. Here we'll eliminate duplicate side-by-side letters from

a string. This is a commonly performed operation in phonetic encoding and approximate search routines, and here

we'll do it in a simple set-based UDF that will work on SQL 2000 or 2005:

CREATE FUNCTION dbo.fnReduceDupes(@string VARCHAR(8000))
RETURNS VARCHAR(8000)
AS
BEGIN
    DECLARE @Result VARCHAR(8000);
    SELECT @Result = @string;
    SELECT @Result =
        STUFF(@Result, Num, 1,
            CASE SUBSTRING(@Result, Num, 1)
                WHEN SUBSTRING(@Result, Num + 1, 1) THEN '!'
                ELSE SUBSTRING(@Result, Num, 1)
            END)
    FROM dbo.Numbers
    WHERE Num <= LEN(@Result);
    SELECT @Result = REPLACE(@Result, '!', '');
    RETURN @Result;
END

This UDF reduces sequences of side-by-side duplicate characters to a single character. It does this by first

replacing all but the last of any sequence of side-by-side duplicate characters with an exclamation point

('!', although any character could be used.) It then uses the Transact-SQL REPLACE function to remove

all exclamation points from the input string. Again, this function would be a perfect candidate for

a WHILE loop, but the numbers table lets us tell SQL Server to do the hard work for us. Here are some samples

to try:

SELECT dbo.fnReduceDupes ('AAAABBBBBCCDDDDEEEEFFGGGGHIIIIJJKK');
SELECT dbo.fnReduceDupes('HHHHHIIIIII TTTTHHHEEEERRRRRRREEEE');

One Last Time

OK, so we can see the usefulness of a numbers table in creating set-based position-dependent string functions. But

all the world is not a string; and often well need to apply business logic that has nothing to do with string

manipulation.

Let's consider the SQL 2000 Northwind Database for a moment. This database has a very simple "inventory"

system of sorts -- the kind you might create for a very basic Webstore. Basically the dbo.Products

table has a single column named UnitsInStock. This is your inventory. I say it's very

basic because it lacks any lot/bin or tracking information. The items and quantities of those items ordered are

stored in the ProductID and Quantity columns of the

dbo.[Order Details] table.

The problem is this: we need to match up the items in inventory to specific customer orders. The only real issue

we might face is the total quantity of a particular product on order totals more than the total quantity of that

product in inventory. I.e., we've sold or promised more than we have on hand. In a more detailed inventory

tracking system (such as the SQL 2005 AdventureWorks shelf/bin inventory system), we are looking at more potential

issues (i.e., a customer orders more of a product than is contained in one bin so we have to pull from multiple

bins to fill one order, assigning items from bins to customer orders, etc.)

A row-by-row cursor might seem like the obvious solution for this problem. After all, we need to assign each item

we have in inventory to a customer order detail row. Also we need to keep track of the inventory quantity to make

sure we don't over-commit and try to ship more product than we have on hand. Or do we?

Here I'll present a SQL 2000 Northwind solution to this problem:

SELECT Order_Items.OrderID, Product_Inventory.ProductID,
    COUNT(*) AS ItemsFromInventory
FROM
(
    SELECT ProductID, Num
    FROM dbo.Numbers n
    INNER JOIN dbo.[Products] p
    ON n.Num BETWEEN 1 AND p.UnitsInStock
) Product_Inventory
INNER JOIN
(
    SELECT OrderID, ProductID,
    (
        SELECT COALESCE(SUM(o1.Quantity),0)
        FROM dbo.[Order Details] o1
        WHERE o1.ProductID = o.ProductID
            AND o1.OrderID < o.OrderID
        ) + n.Num AS Num
    FROM dbo.Numbers n
    INNER JOIN dbo.[Order Details] o
        ON n.Num BETWEEN 1 AND o.Quantity
) Order_Items
ON Product_Inventory.ProductID = Order_Items.ProductID
AND Product_Inventory.Num = Order_Items.Num
GROUP BY Order_Items.OrderID, Product_Inventory.ProductID
ORDER BY Order_Items.OrderID, Product_Inventory.ProductID;

OK, so what just happened here?

To begin with, we joined each product we have in inventory to the numbers table using

ON n.Num BETWEEN 1

AND p.UnitsInStock. What this does is individually number each product item

ordered. By way of example, if we have two of Product ID 998 in stock and three of Product ID 999 in stock, the

Product_Inventory subquery will generate the following result:

ProductIDNum
9981
9982
9991
9992
9993

Next we assigned a number to each item on each order. This is the Order_Items subquery.

It's more complex than the Product_Inventory query because we have to account for both

ProductID and OrderID. The complexity comes because the

items ordered and their quantities are spread out over several rows on several different orders. But we want the

count to go from 1...n where n is the total quantity of a single ordered item across all orders.

Functionally we're just replicating SQL 2005's ROW_NUMBER functionality, which will

do this exact same thing for us when the optional PARTITION BY clause is specified.

We INNER JOIN these two subqueries by their

ProductID and Num columns, which effectively assigns one

item from inventory to one item on each order. We don't have to worry about assigning more items than we have in

inventory since the INNER JOINs in the subqueries won't let us assign

more of an item than we have in inventory. After assigning each inventory item to an order detail item, we group

the results. The COUNT(*) of each group gives us the

total number of each item from inventory that was assigned to a given order.

The SQL 2005 version is similar, although the ROW_NUMBER function and

CTE's will probably make the job easier. Of course, as mentioned, the AdventureWorks sample database uses a

shelf/bin inventory system that's also slightly more complex. I'll leave the SQL 2005 version of this query as

an exercise for the reader.

Conclusions

Numbers tables are an extremely useful tool for writing efficient set-based code to replace cursors and other

procedural constructs.

Rate

5 (1)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (1)

You rated this post out of 5. Change rating