Speeding up updates that use STDEV (standard deviation)

  • I am adding a column to a table in my database that tracks the rolling standard deviation over a specified period of time.

    So far, I have been able to implement this, but it is really slow! I'm hoping someone might have a hint as to how I might be able to speed this procedure up a bit...

    Or maybe I'm heading in the complete wrong direction and SQL is not the right tool to be using for this. I know it's not meant to be a statistics workhorse, but I didn't think STDEV was really that intense of a statistical computation...

    Here's information on the tables and the code I'm using so far - this is for a rolling stdev calculation over the 30 days prior.

    At the start of the procedure, myTable contains rows like:

    IDNumber INT

    Date DATETIME

    OHdiff FLOAT

    OLdiff FLOAT

    StDev_OH_30Day FLOAT

    StDev_OL_30Day FLOAT

    The StDev columns are NULL at the beginning, these are what I will be filling in with the procedure below:

    NOTE: @lastUpdate, @updateThrough, @minIDNumber and @maxIDNumber are set by another part of my query, and are just making sure that I'm not duplicating work that's already been done...

    UPDATE a

    SET StDev_OH_30Day=(SELECT STDEV(b.OHdiff)

    FROM myTable b

    WHERE b.IDNumber= a.IDNumber

    AND b.MarketDate < a.MarketDate

    AND b.MarketDate > a.MarketDate-31 )

    , StDev_OL_30Day=(SELECT STDEV(b.OLdiff)

    FROM myTable b

    WHERE b.IDNumber = a.IDNumber

    AND b.MarketDate < a.MarketDate

    AND b.MarketDate > a.MarketDate-31 )

    FROM myTable a

    WHERE

    a.MarketDate BETWEEN @lastUpdate+.1 AND @updateThrough

    AND a.IDNumber BETWEEN @minIDNumber AND @maxIDNumber

    It takes about 10 minutes to run this query on a little less than 90K rows. Increasing the time period for all the stdev calcs from 30 days to 90 days increases the processing time to 15 minutes.

    This is meant to be part of a sequence of other updates, so the snappier I can make it, the better =)

    Any ideas are always appreciated - Thanks!

  • Might be a cool time to try something built in the SQL CLR to see if you can get any perf gains.

  • Sorry, but I'm not sure I really understand what you mean. I'm still really new to SQL.

    Do you mean defining a user-defined function with C or some other language like in the examples on this page?http://msdn.microsoft.com/en-us/library/w2kae45k(VS.80).aspx

  • can you show us the execution plan when you run the update? that will tell us exactly what the update is doing, and allow us to offer suggestions like indexes on the date column, etc that might speed up the update

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Exactly. And it doesn't need to be just limited to one clr function...just an idea.

  • I would love suggestions on how to improve my indexing!

    Right now I have 3 indexes on my table:

    1) A clustered PK index on a RowID column

    2) A non-clustered index on Date, then IDNumber which includes the StDev cols (and a few other cols not involved in this update)

    3) A non-clustered index on IDNumber then Date which includes the StDev cols (and a few other cols not involved in this update)

    Was that what you were looking for?

    Thanks!

  • I'm not sure why you have two separate subqueries to calculate the two standard deviations. It looks to me like it could be one.

    I would also change the whole thing to a CTE. It should be much faster that way.

  • Sorry, but I'm not sure I follow... I'm learning a lot just from trying to understand what everyone here is suggesting.

    Would you mind explaining a bit further how I might be able to put the two separate subqueries into one as part of an update statement?

    Thanks!

  • In trying to combine the two subqueries into one, I got this far, but have encountered the error messages:

    The multi-part identifier "a.IDNumber" could not be bound.

    The multi-part identifier "a.MarketDate" could not be bound.

    UPDATE a

    SET StDev_OH_30Day=T.StDev_OHdiff

    , StDev_OL_30Day=T.StDev_OLdiff

    FROM myTable,

    (SELECT STDEV(b.OHdiff) AS StDev_OHdiff

    , STDEV(b.OLdiff) AS StDev_OLdiff

    FROM myTable b

    WHERE b.IDNumber = a.IDNumber

    AND b.MarketDate < a.MarketDate

    AND b.MarketDate > a.MarketDate-31 ) AS T

    WHERE

    a.MarketDate BETWEEN @lastUpdate+.1 AND @updateThrough

    AND a.IDNumber BETWEEN @minIDNumber AND @maxIDNumber

    If anyone's got any ideas or suggestions, I would really appreciate it.

    Thanks!

  • Here's a sample of it as a CTE - note the ; before it. You will need that if this is part of a procedure that does other things.

    ;

    WITH get_stddev AS

    ( SELECT a.idnumber ,

    a.marketdate ,

    STDEV(b.OHdiff) stddev_oh ,

    STDEV(b.OLdiff) stddev_ol

    FROM myTable b

    JOIN myTable a

    ON

    (

    b.IDNumber = a.IDNumber

    AND b.MarketDate < a.MarketDate

    AND b.MarketDate > a.MarketDate-31

    )

    WHERE a.MarketDate BETWEEN @lastUpdate + .1 AND @updateThrough

    AND a.IDNumber BETWEEN @minIDNumber AND @maxIDNumber

    GROUP BY a.idnumber,

    a.marketdate

    ) ----- end cte

    UPDATE c

    SET StDev_OH_30Day = stddev_oh ,

    StDev_OL_30Day = stddev_ol

    FROM mytable c

    JOIN get_stddev gs

    ON

    (

    gs.IDNumber = c.IDNumber

    AND c.marketdate = gs.marketdate )

    I urge you to test this thoroughly. I wasn't completely sure of your data set-up. It's possible this needs some changes for your tables.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply