Blog Post

Excel in T-SQL Part 1 – HARMEAN, GEOMEAN and FREQUENCY

,

Today’s blog will be the first in a multi-part series on replicating Excel functions in T-SQL, starting with HARMEAN, GEOMEAN and FREQUENCY.

We’ll focus our solutions on creating T-SQL in-line, Table Valued Functions (iTVFs), introduced in SQL Server 2005, that apply the solution technique to a specifically designed table of sample data.  If you plan on using any of these solutions, this means you’ll need to adapt them to your table of interest.  Due to the need to remain somewhat brief in my explanations, I probably won’t be doing a lot of performance testing of these iTVFs (I certainly will not in today’s blog).  But you can rest assured I’ve put my head to the grindstone in an attempt to do my best to incorporate all of the performance best practices I know of when writing this library of functions.

Some Sample Data

We’ll need to construct some sample data to validate our T-SQL code against, so let’s start with a table and some sample data.

-- A table to hold our sample data
CREATE TABLE dbo.ExcelExamples
(
    Grp         INT
    ,Value      INT
);
GO
INSERT INTO dbo.ExcelExamples (Grp, Value)
-- Grp identifiers 1, 2, 3 and 4 
SELECT 1, n
FROM (VALUES(15),(20),(35),(40),(50)) a (n)
UNION ALL 
SELECT 2, n
FROM (VALUES(3),(6),(7),(8),(8),(10),(13),(15),(16), (20)) a (n)
UNION ALL 
SELECT 3, n
FROM (VALUES(3),(6),(7),(8),(8),(9),(10),(13),(15),(16), (20)) a (n)
UNION ALL
SELECT 4, n
FROM (VALUES(1),(2),(3),(4)) a(n);

Let’s construct an Excel spreadsheet containing this sample data and show Excel’s results for the three functions we’ll attempt to replicate.  The Breaks column represents one of the arguments to the FREQUENCY function.

BLOG - Excel in T-SQL Part 1

In order to use the Excel FREQUENCY function, I needed to learn how to enter array formulas, which you’ll also need to do if you want to try some examples of your own that are like the above.

Harmonic Mean

The harmonic mean of the set of observations represented by our sample can be described as the total number of observations divided by the sum of the reciprocals of the data elements.  If you are into the mathematical representation of this, you can visit the linked Wiki page.

This is actually quite straightforward in T-SQL:

SELECT Grp, HarmonicMean=COUNT(*)/SUM(1./Value)
FROM dbo.ExcelExamples
GROUP BY Grp;
-- Results:
GrpHarmonicMean
126.28285356807711767
28.00674561002681970
38.08789038406971559
41.92000000000307200

Note the correspondence between returned values and those calculated by Excel above.

To keep things neat and tidy, we’ll encapsulate this in an iTVF like I said I would in the introduction.

CREATE FUNCTION dbo.Excel_HarmonicMean
(
    @Grp    INT
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
SELECT Grp=@Grp, HarmonicMean=COUNT(*)/SUM(1./a.Value)
FROM dbo.ExcelExamples a
WHERE @Grp IS NULL OR @Grp = a.Grp;

Which we can call either for a single group or the entire set as follows.

-- Call the iTVF for each group (Grp) in the set
SELECT a.Grp, b.HarmonicMean
FROM
( 
    SELECT a.Grp
    FROM dbo.ExcelExamples a
    GROUP BY a.Grp
) a
CROSS APPLY dbo.Excel_HarmonicMean(a.Grp) b;
-- Call the iTVF with NULL for the Grp to apply to all values in the table
SELECT Grp, HarmonicMean
FROM dbo.Excel_HarmonicMean(NULL);
-- Results:
Grp  HarmonicMean
1    26.28285356807711767
2    8.00674561002681970
3    8.08789038406971559
4    1.92000000000307200
Grp  HarmonicMean
NULL 6.14429720701647651

You can also check that the harmonic mean for all of the sample data matches our Excel spreadsheet example above.  These results show slightly more precision than Excel does, but presumably you can adjust that if you are interested.

Geometric Mean

The geometric mean of the set of observations represented by our sample data can be described as the nth root of the product of the values in the sample.  An alternative description is to find the product of the values in the sample, and raise that to the power that is the reciprocal of n.  Once again, the Wiki page will show you the mathematical representation of this calculation, if that would be more clear to you.

T-SQL has a SQRT function but it does not have a function for calculating the nth root.  It does have a POWER function, which we can use as in the secondary description.

Of course, the real challenge is calculating the product of all items in the sample, which could turn out to be quite large and end up causing an arithmetic overflow quite quickly.  To deal with this issue, we’re going to use a Stupid T-SQL Trick for Logarithms that we learned from an earlier blog.

Let’s look at the results from the following query:

SELECT Grp, SUM(LOG10(Value)), COUNT(*)
FROM dbo.ExcelExamples
GROUP BY Grp
ORDER BY Grp;
-- Results:
Grp  (No column name)   (No column name)
1    7.32221929473392   5               
2    9.70173510878387   10              
3    10.6559776182232   11              
4    1.38021124171161   4 

By summing the LOG10 values, we’ve got a logarithmic number that can be converted back to the product of the numbers that were summed by using POWER(10.0, x), where x is SUM(LOG10(Value)).  That could be quite problematic due to the potential for arithmetic overflow, but is possible for the set of sample data we’ve provided.  If we cast the SUM to a FLOAT, we’ll have a much wider range of values available, albeit at a loss of precision.

Also note that LOG10(0.0) is undefined and will return an error, so if you’ve got zero data points anywhere, you’ll need to handle that case as well.

SELECT Grp, SUM(LOG10(Value)), COUNT(*)
    ,CAST(POWER(10., CAST(SUM(LOG10(Value)) AS FLOAT)) AS FLOAT) 
FROM dbo.ExcelExamples
GROUP BY Grp
ORDER BY Grp;
-- Results:
Grp    (No column name)   (No column name)  (No column name)
1      7.32221929473392   5                 21000000
2      9.70173510878387   10                5031936000
3      10.6559776182232   11                45287424000
4      1.38021124171161   4                 24

For a quick check, we can see that 1*2*3*4 = 24 (the last row or where Grp=4).

We’re not quite there yet, because we still need to take the nth root, but we can now calculate the geometric mean thusly:

SELECT Grp, GeometricMean=
    POWER(CAST(POWER(10., CAST(SUM(LOG10(Value)) AS FLOAT)) AS FLOAT)
        , 1./COUNT(*))              -- Reciprocal of count gives nth root
FROM dbo.ExcelExamples
GROUP BY Grp
ORDER BY Grp;
-- Results:
Grp   GeometricMean
1     29.1369345857619
2     9.33627232372056
3     9.30518984213118
4     2.21336383940064

Our results indicate that we seem to have nailed the geometric mean as calculated by the Excel GEOMEAN function.

Let’s now do what we always like to do and said we were going to do in the introduction, and that is to encapsulate this in an iTVF.

CREATE FUNCTION dbo.Excel_GeometricMean
(
    @Grp    INT
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
SELECT Grp=@Grp, GeometricMean=
    POWER(CAST(POWER(10., CAST(SUM(LOG10(a.Value)) AS FLOAT)) AS FLOAT)
        , 1./COUNT(*))              -- Reciprocal of count gives nth root
FROM dbo.ExcelExamples a
WHERE @Grp IS NULL OR @Grp = a.Grp;

And we can now either call that function for a group or we can calculate the geometric mean across all of the sample data.

-- Call the iTVF for each group (Grp) in the set
SELECT a.Grp, b.GeometricMean
FROM
( 
    SELECT a.Grp
    FROM dbo.ExcelExamples a
    GROUP BY a.Grp
) a
CROSS APPLY dbo.Excel_GeometricMean(a.Grp) b;
-- Call the iTVF with NULL for the Grp to apply to all values in the table
SELECT Grp, GeometricMean
FROM dbo.Excel_GeometricMean(NULL);
-- Results:
Grp   GeometricMean
1     29.1369345857619
2     9.33627232372056
3     9.30518984213118
4     2.21336383940064
Grp   GeometricMean
NULL  9.3040372829131

Once again, note the correspondence between returned values and those calculated by Excel above.

Frequency

Building an iTVF for FREQUENCY is going to be just a little more challenging than the two cases we’ve studied above.  Let’s take a look at a specialized query targeting our first Grp to get started.

DECLARE @Grp  INT = 1;
WITH Ranges (r) AS
(
    -- The ranges of interest for Grp=1 (refer to the Excel example)
    SELECT 10 UNION ALL SELECT 25 UNION ALL SELECT 40
),
    RangesExt (r, rn) AS
(
    -- Add a row number (i.e., extend the ranges)
    SELECT r, rn=ROW_NUMBER() OVER (ORDER BY r)
    FROM Ranges
),
    JoinedRanges AS
(
    SELECT Grp, r1, r2 
    FROM
    (
        -- Full join to increase range rows from 3 to 4
        SELECT r1=a.r, r2=b.r
        FROM RangesExt a
        FULL JOIN RangesExt b ON a.rn = b.rn - 1
    ) a
    CROSS APPLY (SELECT Grp=@Grp) b
)
SELECT *
FROM JoinedRanges;
-- Results:
Grpr1r2
1NULL10
11025
12540
140NULL

If you’ve looked at how FREQUENCY works, it constructs four rows from a three value range (each value being the upper end of a range), in order to find anything less than the second value (first array value returned), greater than the first value (last array value returned), or otherwise between intermediate values.  More than three rows in the range of course always results in one additional row of results.

Our r1 and r2 columns can be used to perform the range checks described above.  In fact, they’re really the heart of the function we’ll show you next.

Note that Excel’s FREQUENCY function may only calculate counts between zero and the value in the first row, whereas what we’re about to do extends the lower end of the range into negative numbers.  The documentation was unclear and I didn’t personally check it out.  But we encourage our valued readers to do so and modify my code accordingly if it better suits your needs that way.

Before we create a function to do the heavy lifting here, we want to mention Table Valued Parameters (TVPs).  This is a way of creating and using tables based on a pattern on-the-fly (sort of) in T-SQL.  I’ve written before about how TVPs can be used.  Let’s create one now (this is done by the CREATE TYPE AS TABLE statement):

CREATE TYPE dbo.Excel_FrequencyRanges AS TABLE
(
    [Range] INT
);
GO
DECLARE @Sample_Range dbo.Excel_FrequencyRanges;
INSERT INTO @Sample_Range
SELECT 10 UNION ALL SELECT 25 UNION ALL SELECT 40;
SELECT *
FROM @Sample_Range;
-- Results:
Range
10
25
40

TVPs can be created as above and passed into a T-SQL function, assuming you are using at least SQL 2008.

So now here’s my take on a T-SQL iTVF to calculate frequency. Note how I pass in the TVP as the second parameter to the function.

CREATE FUNCTION dbo.Excel_Frequency
(
    @Grp        INT
    ,@Ranges    dbo.Excel_FrequencyRanges READONLY
)
RETURNS TABLE WITH SCHEMABINDING
RETURN
WITH RangesExt (r, rn) AS
(
    -- Add a row number
    SELECT r=[Range], rn=ROW_NUMBER() OVER (ORDER BY [Range])
    FROM @Ranges
),
    JoinedRanges AS
(
    SELECT Grp, r1, r2 
        -- Create a string to represent the range
        ,[Range]=CASE 
                WHEN a.r1 IS NULL
                THEN 'x less than or equal to ' + CAST(a.r1 AS VARCHAR(20))
                WHEN a.r2 IS NULL
                THEN 'x greater than ' + CAST(a.r1 AS VARCHAR(20))
                ELSE CAST(a.r1 AS VARCHAR(20)) + ' less than x less than or equal to +
                    CAST(a.r2 AS VARCHAR(20))
                END
    FROM
    (
        -- Full join to increase range rows from 3 to 4
        SELECT r1=a.r, r2=b.r
        FROM RangesExt a
        FULL JOIN RangesExt b ON a.rn = b.rn - 1
    ) a
    CROSS APPLY (SELECT Grp=@Grp) b
)
SELECT a.Grp, a.[Range], Frequency=COUNT(Value)
    ,[Order]=ROW_NUMBER() OVER (ORDER BY r1)
FROM JoinedRanges a
OUTER APPLY
(
    SELECT Grp, Value
    FROM dbo.ExcelExamples b
    WHERE (@Grp IS NULL OR a.Grp = b.Grp) AND
        (r1 IS NULL OR b.Value > r1) AND
        (r2 IS NULL OR b.Value <= r2)
) b
GROUP BY a.Grp, a.[Range], a.r1;

I’ve done a couple of things in our function here that perhaps require explanation:

  • I’ve formatted the ranges we’re interested in into character strings that list out the values compared to an “x” – the column of interest, which in the case of our table is Value.
  • I’ve returned an [Order] column to help in ordering the results after their return should we need it for display purposes.

Let’s run this bad boy through its paces.

DECLARE @Sample_Range dbo.Excel_FrequencyRanges;
-- Create the range for Grp=1
INSERT INTO @Sample_Range
SELECT 10 UNION ALL SELECT 25 UNION ALL SELECT 40;
-- Display the results for Grp=1
SELECT *
FROM dbo.Excel_Frequency(1, @Sample_Range);
-- Clear the table so we can create the range for Grp IN (2, 3)
DELETE FROM @Sample_Range;
-- Create the range for Grp IN (2, 3)
INSERT INTO @Sample_Range
SELECT 5 UNION ALL SELECT 10 
UNION ALL SELECT 15 UNION ALL SELECT 20;
-- Display the results for Grp IN (2, 3)
SELECT a.Grp, b.[Range], b.Frequency
FROM
(
    SELECT Grp
    FROM dbo.ExcelExamples
    WHERE Grp IN (2, 3)
    GROUP BY Grp
) a
CROSS APPLY dbo.Excel_Frequency(a.Grp, @Sample_Range) b
ORDER BY b.Grp, b.[Order];
-- Clear the table so we can create the range for Grp=4
DELETE FROM @Sample_Range;
-- Create the range for Grp=4
INSERT INTO @Sample_Range
SELECT 1 UNION ALL SELECT 3;
-- Display the results for Grp=1
SELECT *
FROM dbo.Excel_Frequency(4, @Sample_Range);
-- Clear the table so we can create the range across all data points
DELETE FROM @Sample_Range;
-- Create the range to use across all data points
INSERT INTO @Sample_Range
SELECT 5 UNION ALL SELECT 10 
UNION ALL SELECT 15 UNION ALL SELECT 20
UNION ALL SELECT 30;
-- Call the iTVF with NULL for the Grp to apply to all values in the table
SELECT *
FROM dbo.Excel_Frequency(NULL, @Sample_Range);

The script above produces the four results sets shown below.

Grp  Range                                    Frequency  Order
1    x less than or equal to 10               0          1
1    10 less than x less than or equal to 25  2          2
1    25 less than x less than or equal to 40  2          3
1    x greater than 40                        1          4
Grp  Range                                    Frequency
2    x less than or equal to 5                1
2    5 less than x less than or equal to 10   5
2    10 less than x less than or equal to 15  2
2    15 less than x less than or equal to 20  2
2    x greater than 20                        0
3    x less than or equal to 5                1
3    5 less than x less than or equal to 10   6
3    10 less than x less than or equal to 15  2
3    15 less than x less than or equal to 20  2
3    x greater than 20                        0
Grp  Range                                    Frequency  Order
4    x less than or equal to 1                1          1
4    1 less than x less than or equal to 3    2          2
4    x greater than 3                         1          3
Grp  Range                                    Frequency  Order
NULL x less than or equal to 5                6          1
NULL 5 less than x less than or equal to 10   11         2
NULL 10 less than x less than or equal to 15  5          3
NULL 15 less than x less than or equal to 20  5          4
NULL 20 less than x less than or equal to 30  0          5
NULL x greater than 30                        3          6

You might want to go back and check those frequencies against the Excel example just to be sure I’m not trying to pull a fast one on you.

So there you have it!  Our first attempt at replicating a few Excel functions in T-SQL.

And here you may have been thinking that I was going to teach you how to excel at writing SQL when you first read that title!  Maybe that will come later, once I’ve figured it out myself.

Follow me on Twitter: @DwainCSQL

© Copyright Dwain Camps 29 Apr 2015.  All rights reserved.

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating