## Introduction

As SQL developers, we try to implement the most concise and well-performing SQL code that conforms to the specification we are given. We also tend to think of performance tuning in terms of crafting the best table indices, avoiding scalar and table valued functions, and analyzing query plans (among other things). But sometimes reformulating the spec by applying some properties of elementary math can be the best way to begin to optimize the performance of SQL queries which implement mathematical formulas. This article is a case study of how I used this technique to optimize my SQL implementation of the Inverse Simpson Index.

## Inverse Simpson Index: The Investment Portfolio Effective Number Use Case

Recently, I wrote a query to compute the Inverse Simpson Index on a table of investment portfolios. The table is defined as:

CREATE TABLE [dbo].[investment_portfolios]( [portfolio_id] [smallint] NOT NULL, [investment_id] [smallint] NOT NULL, [reporting_date] [datetime] NOT NULL, [current_amount] [float] NOT NULL, CONSTRAINT [PK_investment_portfolios] PRIMARY KEY CLUSTERED ( [portfolio_id] ASC, [investment_id] ASC, [reporting_date] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]

It is essentially a time series of investment portfolios. Each portfolio consists of a set of investments and each investment has a current amount for each reporting date. A portfolio’s contents can change over time. Each investment's current amount can change due to gains, losses, and share quantity adjustments. The set of investments in a portfolio can vary with purchases of new investments and sales of existing investments.

My spec was the formula for the Inverse Simpson Index, a measure of diversity which is mathematically defined as:

1/SUM([w_{i}]^{2})

where w_{i} in our case represents the weight of each investment in the portfolio. Each investment’s weight as of a given reporting date is equal to the investment amount divided by the total portfolio amount. So, in our case, the Inverse Simpson Index is a measure of the *effective number* of investments in a portfolio.

A simple example illustrates the difference between the *actual number* and the *effective number*. Suppose portfolio #11 and portfolio #22 each have a total of $100 spread across two investments as of a given reporting date. The difference between them is that portfolio #11 has $50 in each while portfolio #22 has $99 in one and $1 in the other. Each portfolio has two actual investments. However, the effective number of investments in portfolio #11 is two, while for portfolio #22 it is closer to one because one investment represents the lion’s share of that portfolio.

## Implementation

There are a few parts to the implementation, each described below. I walk through the ways that I tried to optimize the effective number.

### Actual Number

A trivial SELECT statement with an aggregate function calculates the actual number of investments in each portfolio:

SELECT portfolio_id, reporting_date, COUNT(*) AS actual_number FROM investment_portfolios GROUP BY portfolio_id, reporting_date ORDER BY portfolio_id, reporting_date DESC

### Effective Number: Straightforward Approach

Implementing the Inverse Simpson Index formula to calculate the effective number of investments in each portfolio would appear to be more complicated. The following code uses a CTE, but it could just as well be implemented with a subquery or a CROSS APPLY.

;WITH portfolio_sum AS ( SELECT portfolio_id, reporting_date, SUM(current_amount) AS portfolio_amount FROM investment_portfolios GROUP BY portfolio_id, reporting_date ) SELECT i.portfolio_id, i.reporting_date, COUNT(*) AS actual_number, 1/SUM(SQUARE(i.current_amount/NULLIF(p.portfolio_amount, 0))) AS effective_number FROM investment_portfolios i JOIN portfolio_sum p ON i.portfolio_id = p.portfolio_id AND i.reporting_date = p.reporting_date GROUP BY i.portfolio_id, i.reporting_date ORDER BY i.portfolio_id, i.reporting_date DESC

### Effective Number: First Optimization Attempt

Now the code for the straightforward approach is really not so complicated, but because my investment_porfolios table contains over a billion records I wanted to simplify it. Given that the portfolio_amount is the sum of the current_amount, my first attempt at eliminating the CTE and JOIN looked like this:

SELECT portfolio_id, reporting_date, COUNT(*) AS actual_number, 1/SUM(SQUARE(current_amount/NULLIF(SUM(current_amount), 0))) AS effective_number FROM investment_portfolios GROUP BY portfolio_id, reporting_date ORDER BY portfolio_id, reporting_date DESC

Unfortunately, all it produced was this error message:

Msg 130, Level 15, State 1, Line 4 Cannot perform an aggregate function on an expression containing an aggregate or a subquery.

Okay, yes, of course I knew that would happen! You can’t embed one aggregate function within another, in this case, a SUM within a SUM. But is there a way to disembed it?

### Effective Number: Optimization via Reformulation

In the summary of G. Polya’s classic book, *How to Solve It ^{[1]}*, he asks, “Could you restate the problem?”. Here the answer is, "Yes!"

Multiply the fraction by itself in order to square it.

[w_{i}]^{2} = (x_{i}/SUM(x_{j}))^{2} = [x_{i}]^{2}/(SUM(x_{j}))^{2} where x is the current investment amount

Now apply the distributive property of multiplication.

SUM([w_{i}]^{2}) = SUM([x_{i}]^{2})/(SUM(x_{j}))^{2}

Finally, just flip this fraction over to obtain its inverse, and disembed the SUM as follows:

SELECT portfolio_id, reporting_date, COUNT(*) AS actual_number, SQUARE(SUM(current_amount))/SUM(SQUARE(current_amount)) AS effective_number FROM investment_portfolios GROUP BY portfolio_id, reporting_date ORDER BY portfolio_id, reporting_date DESC

## Conclusion

Not every SQL implementation of a mathematical formula has an elegant solution, but sometimes reformulating the spec by applying some elementary math can be the best way to begin to optimize its performance.

- G. Polya, “How to Solve It”, 2
^{nd}ed., Princeton University Press, 1957, ISBN 0-691-08097-6. ?