Pure SQL Curve Fitting X/Y Points using T-SQL 2008

  • I've just finished implemented a simple univariate curve-fitting algorithm in T-SQL. This allows you to take N simple X/Y points and then extrapolate a curve that matches all of these points. This uses the principle that for any such list of N points, there is exactly one curve of maximum power N-1 that satisfies these points, for example, 3 points will have a curve of the form:

    y = a + bx + cx^2 ... (and so on up through to zx^(n-1))

    To use this code, you'll need SQL Server 2008, and the rights to create two table types and two functions. Let's start by declaring some useful table types:

    -- Point type

    CREATE TYPE Point AS TABLE (XCoordinate FLOAT, YCoordinate FLOAT);

    GO

    -- Curve fitting coefficient type

    CREATE TYPE Coefficient AS TABLE (Multiplier FLOAT, PowerOfX INT);

    GO

    Next up, the actual Gauss-Jordan curve fitting function, with appropriate commentry:

    /**

    * Function: fn_CurveFitPoints

    * Author: Steven James Gray [ steve@cobaltsoftware.net ]

    * Date: 24th February, 2010

    * Version: 1.0

    * Description:

    * Takes a series of points in a table variant in POINT (X float, Y float) format and

    * computes the n-1 power series curve fit using the Gauss-Jordan elimination process.

    * Return value is a series of Multiplier, Power elements that can be used for producing

    * an estimated Y value for any X value.

    *

    * Please refer to fn_ExtrapolateYValue(@xvalue, @coefficients) for a simple implementation

    * of how to use the output of this function.

    **/

    ALTER FUNCTION dbo.fn_CurveFitPoints

    (

    @PointsToCurveFit POINT readonly

    )

    RETURNS @Result TABLE (Multiplier FLOAT, PowerOfX INT)

    AS

    BEGIN

    -- ==========================================================================================

    -- Stage 1 - Convert @PointsToFit into a matrix

    -- ==========================================================================================

    DECLARE @Matrix TABLE (MatrixRow INT, MatrixColumn INT, MatrixValue FLOAT);

    DECLARE @TotalPoints INT = (SELECT COUNT(1) FROM @PointsToCurveFit);

    WITH NumberProjectionCTE(CurrentNumber)

    AS

    (

    SELECT 1

    UNION ALL

    SELECT 1+CurrentNumber FROM NumberProjectionCTE WHERE CurrentNumber < @TotalPoints

    ) INSERT INTO @Matrix

    SELECT

    Sequence-1, -- Each point gets it's own row

    PWR.CurrentNumber-1, -- Column per power of X

    CASE

    WHEN PWR.CurrentNumber = 1 -- 1st column is X^0 = 1 Always

    THEN 1

    ELSE POWER(XCoordinate,PWR.CurrentNumber-1) -- Raise nth column to power n-1.

    END

    FROM

    NumberProjectionCTE PWR, -- Cross join numeric point data and column indexes

    (SELECT

    ROW_NUMBER() OVER (ORDER BY XCoordinate, YCoordinate) AS Sequence,

    XCoordinate,

    YCoordinate

    FROM

    @PointsToCurveFit

    ) ValueData;

    /* Append Y values as nth column */

    INSERT INTO @Matrix

    SELECT

    ROW_NUMBER() OVER (ORDER BY XCoordinate, YCoordinate) - 1 AS Sequence,

    @TotalPoints,

    YCoordinate

    FROM

    @PointsToCurveFit;

    -- ==========================================================================================

    -- Stage 2 - Compute row echelon form of matrix

    -- ==========================================================================================

    DECLARE @lead INT = 0, @index INT = 0, @current FLOAT;

    DECLARE @Rows INT = (SELECT MAX(MatrixRow) FROM @Matrix);

    DECLARE @Columns INT = (SELECT MAX(MatrixColumn) FROM @Matrix);

    DECLARE @Solved INT -- 0=Unsolvable, 1 = Solved

    DECLARE @r INT = 0

    WHILE @r <= @Rows

    BEGIN

    IF @Columns <= @lead

    BEGIN

    -- Cannot solve this one

    SET @Solved = 0;

    BREAK;

    END;

    SET @index = @r;

    -- Determine if any row swaps are needed.

    WHILE (SELECT MatrixValue FROM @Matrix WHERE MatrixRow = @index AND MatrixColumn = @lead) = 0

    BEGIN

    SET @index = @index + 1;

    IF @Rows = @index

    BEGIN

    SET @index = @r;

    SET @lead = @lead + 1;

    IF @Columns = @lead

    BEGIN

    -- Cannot solve

    SET @Solved = 0;

    BREAK;

    END;

    END;

    END;

    -- Move this row to the correct position if needed.

    IF @index <> @r

    BEGIN

    -- Swap rows

    UPDATE @Matrix

    SET MatrixRow = CASE MatrixRow

    WHEN @r THEN @index

    WHEN @index THEN @r

    END

    WHERE MatrixRow IN (@index, @r);

    END;

    -- Divide this row by it's lead column value, so that this row's lead is 1 (this will actually multiply/increase the value if lead <0)

    DECLARE @Divisor FLOAT = (SELECT MatrixValue FROM @Matrix WHERE MatrixRow = @r AND MatrixColumn = @lead);

    If @Divisor <> 1

    BEGIN

    UPDATE @Matrix SET MatrixValue = MatrixValue / @Divisor WHERE MatrixRow = @r;

    END;

    -- Update other rows and divide them by the appropriate multiple of this row in order to zero the current lead column.

    UPDATE I

    SET

    MatrixValue = I.MatrixValue - (M.MatrixValue * R.MatrixValue)

    FROM

    @Matrix I

    INNER JOIN @Matrix M ON M.MatrixRow = I.MatrixRow AND M.MatrixColumn = @lead

    INNER JOIN @Matrix R ON R.MatrixColumn = I.MatrixColumn AND R.MatrixRow = @r AND R.MatrixRow <> I.MatrixRow

    SET @lead = @lead + 1;

    -- Move to next

    SET @r = @r + 1;

    END;

    -- If we didn't bomb out, we're solved.

    IF @Solved IS NULL

    BEGIN

    SET @Solved = 1

    END;

    -- ==========================================================================================

    -- Stage 3 - Produce coefficients list (The final colum when in REF)

    -- ==========================================================================================

    IF @Solved = 1

    BEGIN

    INSERT INTO @Result (Multiplier, PowerOfX)

    SELECT

    MatrixValue,

    MatrixRow

    FROM @Matrix

    WHERE MatrixColumn = @Columns;

    END;

    RETURN;

    END;

    GO

    To make use of this functions output, you'll need

    CREATE FUNCTION dbo.fn_ExtrapolateYValue

    (

    @XValue FLOAT,

    @Coefficients Coefficient readonly

    )

    RETURNS FLOAT

    AS

    BEGIN

    RETURN (SELECT SUM(Multiplier * POWER(@XValue, PowerOfX)) FROM @Coefficients);

    END

    A simple example, deducing Y=11 + 6x + x^2:

    DECLARE @PointsToCurveFit Point

    -- A few simple X/Y values

    INSERT INTO @PointsToCurveFit SELECT 1,6

    INSERT INTO @PointsToCurveFit SELECT 2,3

    INSERT INTO @PointsToCurveFit SELECT 3,2

    -- Calculate the curve fitting coefficients

    DECLARE @Coefficients Coefficient

    INSERT INTO @Coefficients SELECT * FROM dbo.fn_CurveFitPoints(@PointsToCurveFit);

    -- Shows that y= 11x^0 + 6x + x^2

    SELECT * FROM @Coefficients;

    -- Show the values for X=-5 to 5

    WITH NumberCTE(Number)

    AS

    (

    SELECT -5

    UNION ALL

    SELECT 1 + Number FROM NumberCTE WHERE Number < 5

    ) SELECT

    Number AS XValue,

    dbo.fn_ExtrapolateYValue(Number, @Coefficients) AS YValue

    FROM NumberCTE;

    The results are:

    XY

    -566

    -451

    -338

    -227

    -118

    011

    16

    23

    32

    43

    56

    Share and enjoy.

  • Isn't T-SQL the wrong tool for this sort of job?

    For the curve-fitting, pass a number of built-in Point spatial data types (with geometry and geography variants) to a user-defined aggregate to return the equation.

    For the extrapolation, pass the equation and some other parameters to a streaming CLR table-valued function to return points on that curve.

    Paul

  • I'd originally written this as a pair of CLR functions and used Geometry/Geography, but found little/no benefit in terms of performance/readability - it actually ended up being a lot more code, if anything. Once you take away formatting/spacing there's ~10 lines of actual elimination code left in the SQL. The T-SQL approach also allows for the use of set-operations fairly efficiently on large volumes of data (so for example, the last UPDATE statement would actually translate into dozens and dozens of lines of C#/VB .NET code to deal with the matrix).

    SQL functions also don't allow for dynamic column-sets, so a 3x3 matrix would need to come out in the same X, Y, Value sequence as I've used here. This means any CLR implementation would need a matrix class, a block of code to map the X,Y,V triplets into matrix cells, plus code to map it back out, not counting any code to actually do the actually.

    GJ elimination is effectively an O(N^3) algorithm, and one that doesn't lend itself naturally to parallelism, meaning you cant use SQL's inbuilt SMP and custom aggregates to speed it up either.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply