A genuine use for a SQL CLR Aggregate

  • odeddror (2/5/2014)


    Hi there,

    I'm using SQL Server 2012 x64 and VS 2013

    I follow your code and

    The T SQL return -0.99988192786007

    And CLR return 0.999766931939805 (positive number instead of negative)

    Why they are not the same?

    Thanks,

    Oded Dror

    I'd need to see your input data to debug that.

  • T SQL code - I created stored proc and func

    /* http://www.sqlservercentral.com/articles/SQLCLR/71942/ */

    --linear_regression

    USE Development

    GO

    CREATE TABLE dbo.test_table(

    col1 FLOAT NOT NULL ,

    col2 FLOAT NOT NULL

    )

    GO

    INSERT INTO dbo.test_table

    SELECT TOP 30000 ROW_NUMBER() OVER(ORDER BY s1.id DESC),ROW_NUMBER() OVER(ORDER BY s1.id ASC)

    FROM sysobjects,sysobjects s1

    GO

    Select * from dbo.test_table

    GO

    --Create function that return table

    alter function udf_linear_regression_X ()

    Returns table

    AS

    Return

    (

    SELECT ( COUNT(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AS LinearX

    FROM dbo.test_table

    --select * from dbo.udf_linear_regression_X()

    )

    --Create proc that return one value

    GO

    alter proc usp_linear_regression_X

    AS

    Begin

    declare @x FLOAT

    declare @y FLOAT

    declare @xy FLOAT

    declare @x2 FLOAT

    declare @n FLOAT

    SELECT

    @n = COUNT(*),

    @x = SUM(col1),

    @y = SUM(col2),

    @xy = SUM(col1 * col2),

    @x2 = SUM(col1* col1)

    FROM dbo.test_table

    select ((@n * @xy) - (@x*@y)) / (@n * @x2 - (@x*@x))

    -- execute usp_linear_regression_X

    End

    GO

    truncate table dbo.test_table

    GO

    CREATE ASSEMBLY LinearRegression FROM 'C:\Regression.dll'

    GO

    CREATE AGGREGATE Slope(@x FLOAT, @y FLOAT)

    RETURNS FLOAT

    EXTERNAL NAME LinearRegression.Slope;

    GO

    SELECT dbo.Slope(col1,col2)

    FROM dbo.test_table

    0.999766931939805 --CLR

    -0.99988192786007 --SQL sp

    -0.99988192786007 --SQL Func

    GO

    create VIEW v_Linear_regression_X

    WITH SCHEMABINDING

    as

    SELECT --( COUNT_big(*) * SUM(col1*col2) - ( SUM(col1) * SUM(col2)))/( COUNT_big(*)* SUM(col1*col1)-( SUM(col1)* SUM(col1))) AGG

    COUNT_big(*) cnt

    ,SUM(col1*col2) col1col2

    ,SUM(col1) col1

    ,SUM(col1*col1) col1sq

    ,SUM(col2) col2

    FROM dbo.test_table

    GO

    code for Regression.CS

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    [Serializable]

    [SqlUserDefinedAggregate(

    Format.Native, //use clr serialization to serialize the intermediate result

    IsInvariantToNulls = false, //optimizer property

    IsInvariantToDuplicates = false, //optimizer property

    IsInvariantToOrder = true //optimizer property

    )]

    public struct Slope

    {

    private long count;

    private SqlDouble SumY;

    private SqlDouble SumXY;

    private SqlDouble SumX2;

    private SqlDouble SumY2;

    private SqlDouble SumX;

    static void Main(string[] args)

    {

    }

    public void Init()

    {

    count = 0;

    SumX = SumY = SumXY = SumX2 = SumY2 = 0;

    }

    public void Accumulate(SqlDouble x, SqlDouble y)

    {

    if (!x.IsNull && !y.IsNull)

    {

    count++;

    SumX += (SqlDouble)x;

    SumY += (SqlDouble)y;

    SumXY += (SqlDouble)x * (SqlDouble)y;

    SumX2 += (SqlDouble)x * (SqlDouble)x;

    SumY2 += (SqlDouble)y * (SqlDouble)y;

    }

    }

    public void Merge(Slope other)

    {

    }

    public SqlDouble Terminate()

    {

    if (count > 0)

    {

    SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));

    return value * value;

    }

    else

    {

    return 0;

    }

    }

    }

  • @odeddror I think the answer is in the Terminate function

    public SqlDouble Terminate()

    {

    if (count > 0)

    {

    SqlDouble value = (count * SumXY - (SumX * SumY)) / Math.Sqrt((double)((count * SumX2) - (SumX * SumX)) * (double)((count * SumY2) - (SumY * SumY)));

    return value * value;

    }

    else

    {

    return 0;

    }

    }

    value * value is the equivalent of squaring the result of the preceding formula and the square of any number, whether it be positive or negative is always positive.

    I've been hunting around for my old backup CDs as I replaced my hard drive 18 months ago and don't have my original VS2008 solution.

    It takes a sharp eye to peer review some of the stats functions and the code that drive them so I suspect I may have mixed up the slope and r-squared formulas up.

    The slope should be negative. It looks like you are seeing the squared slope value.

  • Hi there,

    It still positive number

    Thanks,

    Ed Dror

  • Hi,

    I'm a bit late to this discussion but we had a requirement to work out median values in SQL this week so I was looking for a method to achieve this without having to use windowed functions as I'm always looking for ways to keep things simple for our analysts.  Using the examples in this post I was able to put together the following SQL-CLR function.  I tested it with a dataset with 1,000,000 + rows and it worked really well.

    Steve.

  • Wouldn't let me upload file so here is the code:-

    using System;
    using System.Data;
    using System.Data.SqlClient;
    using System.Data.SqlTypes;
    using Microsoft.SqlServer.Server;
    using System.Collections;
    using System.Collections.Generic;
    using System.IO;

    [Serializable]
    [Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
    Format.UserDefined, //Binary Serialization because of List<SqlDouble>
    IsInvariantToNulls = false, //optimizer property
    IsInvariantToDuplicates = false, //optimizer property
    IsInvariantToOrder = true, //optimizer property
    MaxByteSize = -1
    )]
    public struct MEDIAN : IBinarySerialize
    {
    private List<SqlDouble> lstValues;

    public void Init()
    {
    lstValues = new List<SqlDouble>();
    }

    public void Accumulate(SqlDouble x) //this is the input parameter from SQL
    {
    if (!x.IsNull)
    {
    lstValues.Add(x);
    }
    }

    public void Merge (MEDIAN Group)
    {
    lstValues.AddRange(Group.lstValues); //this merges the accumulated values if SQL decides to break this into threads
    }

    public SqlDouble Terminate () //this works out the result to return to SQL
    {
    if (lstValues.Count > 0)
    {
    lstValues.Sort(); //sort the list
    int iMiddleValue = (int)Math.Ceiling(lstValues.Count / 2.0) - 1; //will round up if odd valued list and take off 1 as lists are zero based
    SqlDouble dblMedian;
    if (lstValues.Count % 2 > 0) //odd valued list
    {
    dblMedian = lstValues[iMiddleValue]; //take the middle value
    }
    else //even valued list
    {
    dblMedian = (lstValues[iMiddleValue] + lstValues[iMiddleValue+1]) / 2; //take the mean of the two middle values
    }
    return dblMedian;
    }
    else
    {
    return SqlDouble.Null;
    }
    }

    public void Read(BinaryReader r)
    {
    if (r == null) throw new ArgumentNullException("r");

    //SqlDouble value = r.ReadDouble();

    lstValues = new List<SqlDouble>();
    int pos = 0;
    int length = r.ReadInt32();
    while (pos < length)
    {
    lstValues.Add(r.ReadDouble());
    pos++;
    }

    }

    public void Write(BinaryWriter w)
    {
    if (w == null) throw new ArgumentNullException("w");
    w.Write(lstValues.Count);
    foreach (SqlDouble dbl in lstValues)
    {
    w.Write(dbl.Value);
    }
    }
    }
  • stephen.lightfoot wrote:

    Hi,

    I'm a bit late to this discussion but we had a requirement to work out median values in SQL this week so I was looking for a method to achieve this without having to use windowed functions as I'm always looking for ways to keep things simple for our analysts.  Using the examples in this post I was able to put together the following SQL-CLR function.  I tested it with a dataset with 1,000,000 + rows and it worked really well.

    Steve.

    That's cool and thank you for posting the code but what do you mean by "it worked really well"?  I'm sure that includes that it worked correctly but I'm curious as to how long it took.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Hi took a four column data set containing waiting times measurements with 1,006,934 rows and then used the median function to calculate the median wait.  The query execution took 48 seconds and returned 47 rows.  I haven't had time to compare with SQL only methods but think this is a pretty reasonable response time and the code looks really simple (please see below).

    DECLARE
    @StartDateAS DATE = '01/11/2012'
    ,@EndDateAS DATE = '31/10/2019';

    SELECT
    dcu.CommunityUnitLocalDescription
    ,tsrta.Division
    ,tsrta.Unit
    ,MedianWaitTime = Utilities.dbo.MEDIAN(DATEDIFF(DAY, dd2.ActualDate, dd.ActualDate))
    FROM
    dbo.Fact_Community_Activityfca
    LEFT JOIN dbo.Dim_Community_Unitdcu
    ON dcu.pkCommunityUnitKey = fca.CommunityUnitKey

    LEFT JOIN dbo.Dim_Datedd
    ON dd.pkDateKey = fca.CareContactDateKey

    LEFT JOIN dbo.Fact_Referral fr
    ON fr.pkReferralKey = fca.ReferralKey

    LEFT JOIN dbo.Dim_Datedd2
    ON dd2.pkDateKey = fr.ReferralReceivedDateKey

    OUTER APPLY
    (
    SELECTTOP(1)
    Division
    ,Unit
    FROM
    Reporting.dbo.tblSRRevisedTeamActivity
    WHERE
    Unit = dcu.CommunityUnitLocalDescription
    )tsrta
    WHERE
    dd.ActualDate BETWEEN @StartDate AND @EndDate
    AND fca.IsInitialConsultation= 1
    GROUP BY
    dcu.CommunityUnitLocalDescription
    ,tsrta.Division
    ,tsrta.Unit;

    Overall I'm quite impressed with SQL-CLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before.  By the way, I ran this on a SQL Server 2016 Developer edition virtual machine with 64Gb RAM and four cores.

    Steve

  • Awesome.  Thanks for the feedback, Steve.  I really appreciate it.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • >>I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS. <<

    I'm inclined to disagree with this. I would love to see SPSS, SAS and other stat packages be able to have embedded SQL, however. Statistical packages have to be able to handle some really weird floating-point math and other corrections under the covers. The whole idea of SQL was to manipulate data, not to do fancy calculations. You either have to put up with bad statistics from a product designed for data or incredibly poor performance from a computational product that's constantly adjusting data. There is a reason that we have a saw and a screwdriver in carpentry and we don't try to saw wood with a screwdriver or drive screws with a saw 🙂

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 wrote:

    >>I should like to see Microsoft consider adding more in-built statistical functions given the way that the BI world is heading. I should also like to see some focusing on integration with external stats packages such as SPSS and SAS. <<

    I'm inclined to disagree with this. I would love to see SPSS, SAS and other stat packages be able to have embedded SQL, however. Statistical packages have to be able to handle some really weird floating-point math and other corrections under the covers. The whole idea of SQL was to manipulate data, not to do fancy calculations. You either have to put up with bad statistics from a product designed for data or incredibly poor performance from a computational product that's constantly adjusting data. There is a reason that we have a saw and a screwdriver in carpentry and we don't try to saw wood with a screwdriver or drive screws with a saw 🙂

    Heh... I dunno Joe... I have a power screwdriver attachment on my power saw and a Dozuki blade attachment on my replaceable head screwdriver. 😀

    Shifting gears a bit, I mostly agree with not fettering T-SQL with the types of things you speak of.  There is some merit, though, in being able to do some of those things FROM T-SQL.   That's why they've made it possible to interface "R" and Python from T-SQL.

    Now, if we could just get them to create a Decimal/Numeric datatype that works as well as Granny's 4 function calculator without it occasionally automatically rounding to 6 decimal places. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • stephen.lightfoot wrote:

    Overall I'm quite impressed with SQLCLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before.

    Hi Steve. If you want more examples of SQLCLR functionality to test / play with, there are plenty to choose from in the Free version of SQL#, a SQLCLR library that I wrote, including a few User-Defined Aggregates (one of which is Median 😺):

     

    https://SQLsharp.com/

     

    Take care,

    Solomon...

     

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • stephen.lightfoot wrote:

    Hi Jeff,

    Hi took a four column data set containing waiting times measurements with 1,006,934 rows and then used the median function to calculate the median wait.  The query execution took 48 seconds and returned 47 rows.  I haven't had time to compare with SQL only methods but think this is a pretty reasonable response time and the code looks really simple (please see below).

    Overall I'm quite impressed with SQL-CLR performance and wanted to experiment with the aggregate functions which I'd not had time to look into before.  By the way, I ran this on a SQL Server 2016 Developer edition virtual machine with 64Gb RAM and four cores.

    Steve

    Hi Steve

    Looking at your code, you have a LEFT JOIN with a WHERE.  This is the same as an INNER JOIN.

    It is my understanding that the logical query processing will provide better performance from the inner join with the filter in the join.  I would be curious to see if the following makes a difference.

    Replace

    FROM dbo.Fact_Community_Activity  fca
    LEFT JOIN dbo.Dim_Date dd
    ON dd.pkDateKey = fca.CareContactDateKey
    WHERE dd.ActualDate BETWEEN @StartDate AND @EndDate
    AND fca.IsInitialConsultation = 1

    with

    FROM dbo.Fact_Community_Activity  fca
    INNER JOIN dbo.Dim_Date dd -- Add the INNER JOIN immediately after the FROM to filter out sooner
    ON dd.pkDateKey = fca.CareContactDateKey
    AND dd.ActualDate BETWEEN @StartDate AND @EndDate
    AND fca.IsInitialConsultation = 1

     

    Also the following left join appears to serve no purpose other than to give SQL some extra work, so I would remove it

    LEFT JOIN dbo.Fact_Referral fr
    ON fr.pkReferralKey = fca.ReferralKey

     

     

  • Hi Des,

    Sorry, I got the left join to Fact_Referral in the wrong place in the code I posted.  I've changed the order in my post.

    I tend to use left joins a lot as I find they work faster for me.  I'm no DBA so don't fully understand the reasons behind this.  In the case of this query I was curious so I've tried your INNER join and I got exactly the same results and same query plan as the original code.  Not sure if this is always the case though.  I would be interested to do more research into this topic as quite often I have to try to speed up code written by others.

    Solomon - thanks for the link.  I've definitely download this and have a look.  Could save me a lot of time in future!

    Steve

  • SAS makes much of its ability to push models down to Teradata.  From what I have seen of SAS you are best off doing as much as possible in the DB to get the performance you will need.

    In terms of "should a DB platform have heavyweight statistical functions" Vertica is a column store that has a number of stats and ML functions.  I think we are in the age where we take the code to the data and not the data to the code. It is probably easier to make a DB engine support stats than a stats package to play nicely with multi-concurrent data access

Viewing 15 posts - 31 through 44 (of 44 total)

You must be logged in to reply to this topic. Login to reply