As SQL developers, we try to implement the most concise and well-performing SQL code that conforms to the specification we are given. We also tend to think of performance tuning in terms of crafting the best table indices, avoiding scalar and table valued functions, and analyzing query plans (among other things). But sometimes reformulating the spec by applying some properties of elementary math can be the best way to begin to optimize the performance of SQL queries which implement mathematical formulas. This article is a case study of how I used this technique to optimize my SQL implementation of the Inverse Simpson Index.
Inverse Simpson Index: The Investment Portfolio Effective Number Use Case
Recently, I wrote a query to compute the Inverse Simpson Index on a table of investment portfolios. The table is defined as:
CREATE TABLE [dbo].[investment_portfolios]( [portfolio_id] [smallint] NOT NULL, [investment_id] [smallint] NOT NULL, [reporting_date] [datetime] NOT NULL, [current_amount] [float] NOT NULL, CONSTRAINT [PK_investment_portfolios] PRIMARY KEY CLUSTERED ( [portfolio_id] ASC, [investment_id] ASC, [reporting_date] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
It is essentially a time series of investment portfolios. Each portfolio consists of a set of investments and each investment has a current amount for each reporting date. A portfolio’s contents can change over time. Each investment's current amount can change due to gains, losses, and share quantity adjustments. The set of investments in a portfolio can vary with purchases of new investments and sales of existing investments.
My spec was the formula for the Inverse Simpson Index, a measure of diversity which is mathematically defined as:
where wi in our case represents the weight of each investment in the portfolio. Each investment’s weight as of a given reporting date is equal to the investment amount divided by the total portfolio amount. So, in our case, the Inverse Simpson Index is a measure of the effective number of investments in a portfolio.
A simple example illustrates the difference between the actual number and the effective number. Suppose portfolio #11 and portfolio #22 each have a total of $100 spread across two investments as of a given reporting date. The difference between them is that portfolio #11 has $50 in each while portfolio #22 has $99 in one and $1 in the other. Each portfolio has two actual investments. However, the effective number of investments in portfolio #11 is two, while for portfolio #22 it is closer to one because one investment represents the lion’s share of that portfolio.
There are a few parts to the implementation, each described below. I walk through the ways that I tried to optimize the effective number.
A trivial SELECT statement with an aggregate function calculates the actual number of investments in each portfolio:
SELECT portfolio_id, reporting_date, COUNT(*) AS actual_number FROM investment_portfolios GROUP BY portfolio_id, reporting_date ORDER BY portfolio_id, reporting_date DESC
Effective Number: Straightforward Approach
Implementing the Inverse Simpson Index formula to calculate the effective number of investments in each portfolio would appear to be more complicated. The following code uses a CTE, but it could just as well be implemented with a subquery or a CROSS APPLY.
;WITH portfolio_sum AS ( SELECT portfolio_id, reporting_date, SUM(current_amount) AS portfolio_amount FROM investment_portfolios GROUP BY portfolio_id, reporting_date ) SELECT i.portfolio_id, i.reporting_date, COUNT(*) AS actual_number, 1/SUM(SQUARE(i.current_amount/NULLIF(p.portfolio_amount, 0))) AS effective_number FROM investment_portfolios i JOIN portfolio_sum p ON i.portfolio_id = p.portfolio_id AND i.reporting_date = p.reporting_date GROUP BY i.portfolio_id, i.reporting_date ORDER BY i.portfolio_id, i.reporting_date DESC
Effective Number: First Optimization Attempt
Now the code for the straightforward approach is really not so complicated, but because my investment_porfolios table contains over a billion records I wanted to simplify it. Given that the portfolio_amount is the sum of the current_amount, my first attempt at eliminating the CTE and JOIN looked like this:
SELECT portfolio_id, reporting_date, COUNT(*) AS actual_number, 1/SUM(SQUARE(current_amount/NULLIF(SUM(current_amount), 0))) AS effective_number FROM investment_portfolios GROUP BY portfolio_id, reporting_date ORDER BY portfolio_id, reporting_date DESC
Unfortunately, all it produced was this error message:
Msg 130, Level 15, State 1, Line 4 Cannot perform an aggregate function on an expression containing an aggregate or a subquery.
Okay, yes, of course I knew that would happen! You can’t embed one aggregate function within another, in this case, a SUM within a SUM. But is there a way to disembed it?
Effective Number: Optimization via Reformulation
In the summary of G. Polya’s classic book, How to Solve It, he asks, “Could you restate the problem?”. Here the answer is, "Yes!"
Multiply the fraction by itself in order to square it.
[wi]2 = (xi/SUM(xj))2 = [xi]2/(SUM(xj))2 where x is the current investment amount
Now apply the distributive property of multiplication.
SUM([wi]2) = SUM([xi]2)/(SUM(xj))2
Finally, just flip this fraction over to obtain its inverse, and disembed the SUM as follows:
SELECT portfolio_id, reporting_date, COUNT(*) AS actual_number, SQUARE(SUM(current_amount))/SUM(SQUARE(current_amount)) AS effective_number FROM investment_portfolios GROUP BY portfolio_id, reporting_date ORDER BY portfolio_id, reporting_date DESC
Not every SQL implementation of a mathematical formula has an elegant solution, but sometimes reformulating the spec by applying some elementary math can be the best way to begin to optimize its performance.
- G. Polya, “How to Solve It”, 2nd ed., Princeton University Press, 1957, ISBN 0-691-08097-6. ?