Automating linear regressions

  • Good morning,

    I hope I'm on the right place to post my question ...

    I work on a project for research where I need to perform a linear regression on some big data and then test the results on a database according to the factors given by the linear regression. This would not be cumbersome if I had to do it 2-3 times only, but as I need to do it a lot of times, I'm trying to figure out how I could automate this.

    Currently, I determine the data as input to the LR and the output; then I copy it to Excel and launch the linear regression; finally I modify the resulting factors in a stored procedure that will check the database and give me a final result that I can compare with previous ones. I hope this makes sense.

    What I'd like to do is automatically:
    1. select the input columns and the output one
    2. call "something" to perform the regression tests
    3. retrieve the resulting factors in a SP
    4. run this latter and insert the result in a DB

    Whilst the steps 1 and 4 are straightforward, I very much look forward to reading your suggestions for step 2 and 3.
    (I currently use SQL Server 2012-Express, but could move to SQL Server 2016 standard edition if needed)

    Thanks a lot!

  • It has been a while when I created the script below together with my co-worker. We used it to forecast growth of disk usage of which the measured values were stored in a table. Together with a tally table and a table to hold calculated values, we created a resultset with calculated forecasted values for the upcoming six months (with measured values of the last 12 months).

    Try to modify the code to fit to your situation. Hope it helps...
    -- create supporting tables
    declare @datum datetime
    declare @servername sysname
    set @servername = 'hostserver'
    set @datum = convert(nvarchar(8), getdate(), 120) + '01'    -- first day of current month

    IF OBJECT_ID('Tempdb..#Calculate') IS NOT NULL
        DROP TABLE #Calculate;
    -- Create table to hold calculated values
    CREATE TABLE #Calculate
        (Servername sysname
        , Instance sysname
        , drive char(1)
        , mountpoint varchar(50)
        , Capacity_Gb decimal (18,2)
        , [-12 mnd] decimal (18,2)
        , [-11 mnd] decimal (18,2)
        , [-10 mnd] decimal (18,2)
        , [-9 mnd] decimal (18,2)
        , [-8 mnd] decimal (18,2)
        , [-7 mnd] decimal (18,2)
        , [-6 mnd] decimal (18,2)
        , [-5 mnd] decimal (18,2)
        , [-4 mnd] decimal (18,2)
        , [-3 mnd] decimal (18,2)
        , [-2 mnd] decimal (18,2)
        , [-1 mnd] decimal (18,2)
        , [0 mnd] decimal (18,2)
        , avg_y decimal (18,2)
        , a decimal (18,2)
        , b decimal (18,2)
        , [+1 mnd] decimal (18,2)
        , [+2 mnd] decimal (18,2)
        , [+3 mnd] decimal (18,2)
        , [+4 mnd] decimal (18,2)
        , [+5 mnd] decimal (18,2)
        , [+6 mnd] decimal (18,2)
        )

    IF OBJECT_ID('Tempdb..#Tally_Date') IS NOT NULL
        DROP TABLE #Tally_Date;
    -- Create tally table
    CREATE TABLE #Tally_Date (Datum Datetime)
    -- fill it with required dates
    INSERT INTO #Tally_Date
        SELECT dateadd(month, -12, @datum)
        UNION ALL
        SELECT dateadd(month, -11, @datum)
        UNION ALL
        SELECT dateadd(month, -10, @datum)
        UNION ALL
        SELECT dateadd(month, -9, @datum)
        UNION ALL
        SELECT dateadd(month, -8, @datum)
        UNION ALL
        SELECT dateadd(month, -7, @datum)
        UNION ALL
        SELECT dateadd(month, -6, @datum)
        UNION ALL
        SELECT dateadd(month, -5, @datum)
        UNION ALL
        SELECT dateadd(month, -4, @datum)
        UNION ALL
        SELECT dateadd(month, -3, @datum)
        UNION ALL
        SELECT dateadd(month, -2, @datum)
        UNION ALL
        SELECT dateadd(month, -1, @datum)
        UNION ALL
        SELECT @datum
        UNION ALL
        SELECT dateadd(month, 1, @datum)
        UNION ALL
        SELECT dateadd(month, 2, @datum)
        UNION ALL
        SELECT dateadd(month, 3, @datum)
        UNION ALL
        SELECT dateadd(month, 4, @datum)
        UNION ALL
        SELECT dateadd(month, 5, @datum)
        UNION ALL
        SELECT dateadd(month, 6, @datum)

    insert into #Calculate
        (Servername
        , Instance
        , Drive
        , Mountpoint
        , Capacity_Gb
        , [-12 mnd]
        , [-11 mnd]
        , [-10 mnd]
        , [-9 mnd]
        , [-8 mnd]
        , [-7 mnd]
        , [-6 mnd]
        , [-5 mnd]
        , [-4 mnd]
        , [-3 mnd]
        , [-2 mnd]
        , [-1 mnd]
        , [0 mnd]
        )
    select
        Servername
        , Instance
        , Drive
        , Mountpoint
        , Capacity_Gb
        , COALESCE([-12], 0) as '-12 mnd'
        , COALESCE([-11], 0) as '-11 mnd'
        , COALESCE([-10], 0) as '-10 mnd'
        , COALESCE([-9], 0) as '-9 mnd'
        , COALESCE([-8], 0) as '-8 mnd'
        , COALESCE([-7], 0) as '-7 mnd'
        , COALESCE([-6], 0) as '-6 mnd'
        , COALESCE([-5], 0) as '-5 mnd'
        , COALESCE([-4], 0) as '-4 mnd'
        , COALESCE([-3], 0) as '-3 mnd'
        , COALESCE([-2], 0) as '-2 mnd'
        , COALESCE([-1], 0) as '-1 mnd'
        , COALESCE([0], 0) as '0 mnd'
    from
        (select
            Servername
            , Instance
            , Drive
            , MountPoint
            , datediff(mm, @datum, #Tally_Date.Datum) as Maand
            , MAX(Capacity) AS Capacity_Gb
            , MAX(capacity-free) AS MaximumUsedSpace
        from #Tally_Date
            left outer join dbo.mon_server_disks
                -- join on time range
                on #Tally_Date.Datum >= startdatum and #Tally_Date.Datum < COALESCE(einddatum, GETDATE())
        where
            drive <> 'q'
            and servername like @servername
            and instance IS NOT NULL
        group by
            servername
            , instance
            , Drive
            , MountPoint
            , #Tally_Date.Datum
            , startdatum
        )sel_alias
    PIVOT
        (MAX(MaximumUsedSpace)
        FOR Maand
            IN ([-12]
                , [-11]
                , [-10]
                , [-9]
                , [-8]
                , [-7]
                , [-6]
                , [-5]
                , [-4]
                , [-3]
                , [-2]
                , [-1]
                , [0]
                )
        )
        as piv_alias
    order by
            servername
            , Instance
            , Drive
            , MountPoint

    --x = 1,2,3,4,5,6,7,8,9,10,11,12,13
    --avg_x = (1+2+3+4+5+6+7+8+9+10+11+12+13) / 13 = 7
    --divider for a = sum( (x - avg_x)2 ) = 182

    update #Calculate
        set avg_y = ([-12 mnd]+[-11 mnd]+[-10 mnd]+[-9 mnd]+[-8 mnd]+[-7 mnd]+[-6 mnd]+[-5 mnd]+[-4 mnd]+[-3 mnd]+[-2 mnd]+[-1 mnd]+[0 mnd])/13
    update #Calculate
        set a=((-6*([-12 mnd]-avg_y))+(-5*([-11 mnd]-avg_y))+(-4*([-10 mnd]-avg_y))+(-3*([-9 mnd]-avg_y))+(-2*([-8 mnd]-avg_y))+(-1*([-7 mnd]-avg_y))+(0*([-6 mnd]-avg_y))+(1*([-5 mnd]-avg_y))+(2*([-4 mnd]-avg_y))+(3*([-3 mnd]-avg_y))+(4*([-2 mnd]-avg_y))+(5*([-1 mnd]-avg_y))+(6*([0 mnd]-avg_y)))/182
    update #Calculate
        set b=avg_y-a*7
    update #Calculate
        set [+1 mnd]=a*14+b
        , [+2 mnd]=a*15+b
        , [+3 mnd]=a*16+b
        , [+4 mnd]=a*17+b
        , [+5 mnd]=a*18+b
        , [+6 mnd]=a*19+b

    select
        Servername
        , Instance
        , drive
        , mountpoint
        , Capacity_Gb
        , [-12 mnd]
        , [-11 mnd]
        , [-10 mnd]
        , [-9 mnd]
        , [-8 mnd]
        , [-7 mnd]
        , [-6 mnd]
        , [-5 mnd]
        , [-4 mnd]
        , [-3 mnd]
        , [-2 mnd]
        , [-1 mnd]
        , [0 mnd]
        , [+1 mnd]
        , [+2 mnd]
        , [+3 mnd]
        , [+4 mnd]
        , [+5 mnd]
        , [+6 mnd]
        , case
            when [+1 mnd]>Capacity_Gb then 'extreme'
            when [+2 mnd]>Capacity_Gb then 'critical'
            when [+3 mnd]>Capacity_Gb then 'warning'
            when [+4 mnd]>Capacity_Gb then 'order'
            when [+5 mnd]>Capacity_Gb then 'monitor'
            when [+6 mnd]>Capacity_Gb then 'none'
            else ''
        end as Alert
    from #Calculate

    -- clean-up
    IF OBJECT_ID('Tempdb..#Tally_Date') IS NOT NULL
        DROP TABLE #Tally_Date
    IF OBJECT_ID('Tempdb..#Calculate') IS NOT NULL
        DROP TABLE #Calculate;

    ** Don't mistake the ‘stupidity of the crowd’ for the ‘wisdom of the group’! **
  • Vielen Dank HanShi for sharing your script.
    At first sight I can't see any matrix inversion in it (that's in fact the function I need most in matrix calculations 😉 ), but I'll look into it ...

    If anyone else has a got a clue to perform (efficient) matrix calculations in SQL server, I'm still buyer.

    Have a nice day ...
    🙂

  • We use R to do similar processing. As we don't have SQL 2016, we use the standalone version of R called by a SSIS package which reads and writes to the SQL Server database.

    In SQL 2016 R services are available straight from the database. 
    This seems to be a good starting place
    http://www.sqlservercentral.com/articles/RevolutionR/134077/

  • thanks alastair.beveridge ... I'll have a look at this ... it will not be as easy to implement as I initially thought of, but the power behind can be very helpful and it might be worth spending some extra time for training me into R ... 🙂

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply