User Defined Function - Passing and Processing a Table

  • Jeff Moden (4/28/2008)[/bIt really would help if the OP would post exactly what output is expected from the input in the original post, though.

    I agree with all your points, I was only trying to explain why people were still answering a question that appeared to have been answered.

    The question asked was definately answered.

    This kind of goes with Steve editorial today. I can't even guess at how many times I have been asked how to do something, only to find out after the fact that my "Solution" caused more problems because the person asked me the question didn't understand their problem, but was successful in asking a question that would get them the answer "they" wanted, because other attempts at asking the right question got them an answer they didn't want. (whew... talk about run on 😉 , also too lazy to fix it ) Easy software, just makes making mistakes easier.

    Obviously as an old Oracle DBA, I am used to things not being easy 🙂

  • Ok - in short, heres the problem.

    I have variosu financial factors (P/E, P/B) performance - by country and by sector in an SQL table. This performance is on a daily basis. I need to create a mean, stdev, median, 95th percentile and 5th percentile for the same. I have managed the mean and stdev in the pivot table (using inbuilt functions). Was hoping for an elegant solution where I pass a column to a function which gives me the median for example. The structure is simple

    I pass a column (containing n values) to a function. The function orders these, then chooses the n/2 value (if n is even, or (n+(n+1))/2 if n is odd) and returns this one value. Easy enough! Trouble is in passing the array.

    Given the structure of the pivot table query, I can only ask it (e.g. if DATA is the name of the column) to SUM(DATA) or MEDIAN(DATA). It fails even on MEDIAN(DATA/5) (I know I can do DATA/5 while pulling out - this is just an example).

    Reason I am using the pivot table is because the data is in a list format i.e. stock 1, date 1, data 1, stock 1, date2, data2... so on - with about a million plus rows. The pivot table takes a lot of hassle away in the presentation...

  • Can't do a whole lot for you if you don't post the "simple structure" of the data you're talking about... please click on, read, and provide the information from the URL in may signature.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Median is mathematically not an aggregate function per se - it's more of a windowing function (meaning - it's pulling out unchanged a speicific value in a sequence). You REALLY don't want to compute that as TransactSQL function, this should be done as a derived table and then incorporated in.

    Median looks something like this in SQL 2005

    select groupCol1,groupCol2, orderamount from

    (select groupcol1,

    groupCol2,

    orderamount,

    row_number() over (partition by groupcol1,groupcol2 order by orderamount) RN,

    Count(*) over (partition by groupcol1,groupcol2) groupcount

    from Mytable) r

    where rn=ceiling((groupcount+1)*0.5)

    You can pivot table if need be - but perf is going to start suffering pretty badly. Perhaps dump this to a temp table, index that and pivot on that temp table.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Ok... just sharing... a million row median calculation can be fast... very fast... here's a chunk of code I posted about 3 years ago... it's demo code so you'll need to change things to suite your own tables...

    --===== If the temp table already exists, drop it

    IF OBJECT_ID('TempDB..#Steps') IS NOT NULL

    DROP TABLE #Steps

    --===== Simulate loading your real temp table with ...

    -- A MILLION rows, 17576 reps (3 character "names"), a DeltaT from 0 to 1999,

    -- and 4 "Steps" (Step 1, Step 2, etc)

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum,

    CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS Rep,

    CAST(RAND(CAST(NEWID() AS VARBINARY))*2000 AS INT) AS DeltaT,

    'Step '

    + CAST(CAST(RAND(CAST(NEWID() AS VARBINARY))*4 AS INT)+1 AS VARCHAR(25)) AS MileCode

    INTO #Steps

    FROM Master.dbo.SYSCOLUMNS sc1,

    Master.dbo.SYSCOLUMNS sc2

    --===== Every table, even temporary ones, should have a Primary Key...

    -- Makes all the difference in the world, performance wise, on this script.

    ALTER TABLE #Steps

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== And, trust me, you want this index. It's a "covering" index for both the

    -- ORDER BY's, the JOINs, and the data that will appear in the code that follows.

    CREATE INDEX IX_Steps ON #Steps

    (Rep,MileCode,DeltaT)

    --===== Return the Median for ALL reps by name and by step.

    -- It returns all the "analysis" rows in just 32 seconds.

    -- 1 MILLION rows, 17576 Reps, and 4 steps are simulated in the table

    -- (note that because this is all random data, some reps may be missing a step, but most have 4)

    -- for a total of about 70,200 Medians in just 32 seconds

    -- (or 1 Median analysis in on 0.0004558 seconds)

    SELECT DISTINCT

    m.Rep,

    m.MileCode,

    (

    (SELECT TOP 1 DeltaT FROM

    (SELECT TOP 50 PERCENT DeltaT FROM #Steps t

    WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode

    ORDER BY DeltaT ASC

    ) lo

    ORDER BY DeltaT DESC)

    +(SELECT TOP 1 DeltaT FROM

    (SELECT TOP 50 PERCENT DeltaT FROM #Steps t

    WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode

    ORDER BY DeltaT DESC

    ) hi

    ORDER BY DeltaT ASC)

    ) / 2 AS MEDIAN

    FROM #Steps m

    ORDER BY m.Rep,m.MileCode

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Darned "smiley" faces... lemme try again...

    --===== If the temp table already exists, drop it

    IF OBJECT_ID('TempDB..#Steps') IS NOT NULL

    DROP TABLE #Steps

    --===== Simulate loading your real temp table with ...

    -- A MILLION rows, 17576 reps (3 character "names"), a DeltaT from 0 to 1999,

    -- and 4 "Steps" (Step 1, Step 2, etc)

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum,

    CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS Rep,

    CAST(RAND(CAST(NEWID() AS VARBINARY))*2000 AS INT) AS DeltaT,

    'Step '

    + CAST(CAST(RAND(CAST(NEWID() AS VARBINARY))*4 AS INT)+1 AS VARCHAR(25)) AS MileCode

    INTO #Steps

    FROM Master.dbo.SYSCOLUMNS sc1,

    Master.dbo.SYSCOLUMNS sc2

    --===== Every table, even temporary ones, should have a Primary Key...

    -- Makes all the difference in the world, performance wise, on this script.

    ALTER TABLE #Steps

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== And, trust me, you want this index. It's a "covering" index for both the

    -- ORDER BY's, the JOINs, and the data that will appear in the code that follows.

    CREATE INDEX IX_Steps ON #Steps

    (Rep,MileCode,DeltaT)

    --===== Return the Median for ALL reps by name and by step.

    -- It returns all the "analysis" rows in just 32 seconds.

    -- 1 MILLION rows, 17576 Reps, and 4 steps are simulated in the table

    -- (note that because this is all random data, some reps may be missing a step, but most have 4)

    -- for a total of about 70,200 Medians in just 32 seconds

    -- (or 1 Median analysis in on 0.0004558 seconds)

    SELECT DISTINCT

    m.Rep,

    m.MileCode,

    (

    (SELECT TOP 1 DeltaT FROM

    (SELECT TOP 50 PERCENT DeltaT FROM #Steps t

    WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode

    ORDER BY DeltaT ASC

    ) lo

    ORDER BY DeltaT DESC)

    +(SELECT TOP 1 DeltaT FROM

    (SELECT TOP 50 PERCENT DeltaT FROM #Steps t

    WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode

    ORDER BY DeltaT DESC

    ) hi

    ORDER BY DeltaT ASC))

    / 2 AS MEDIAN

    FROM #Steps m

    ORDER BY m.Rep,m.MileCode

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • ..by the way - it's probably worth pointing out that there are multiple definitions of Median.

    The solution Jeff is showing is what is called the statistical median (the average of the two middle elements in an ordered set if the set is even, and the middle item it the set is odd), and mine was the "pure math" (from ordered sets/sequential math) definition (middle element if the set is odd and the (n/2+1)th item if it's even).

    Depending on how you talk to - they might expect one definition or the other.

    And in terms of efficiency - if you do it as a derived table - it should be quite fast either way. The function would NOT be so lucky however.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • (Jeff - psst)....

    (1000000 row(s) affected)

    =====================

    Jeff's Statistical Median Calc

    =====================

    (70266 row(s) affected)

    SQL Server Execution Times:

    CPU time = 70031 ms, elapsed time = 42039 ms.

    ========================

    Matt's Mathematical Median Calc

    ========================

    (70266 row(s) affected)

    SQL Server Execution Times:

    CPU time = 16938 ms, elapsed time = 18221 ms.

    So in case someone doesn't specify which one they want - you might get away with the "faster" one....:)

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • One more - this is the "windowed function" version of the statistical median:

    select r.Rep,r.MileCode, (r.deltat +r1.DeltaT)*.5 dt from

    (select m.Rep,m.MileCode,

    deltat,

    row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,

    Count(*) over (partition by m.Rep,m.MileCode) groupcount

    from #Steps m) r inner join

    (select m.Rep,m.MileCode,

    deltat,

    row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,

    Count(*) over (partition by m.Rep,m.MileCode) groupcount

    from #Steps m) r1 on r.rep=r1.rep and r.milecode=r1.milecode

    where r.rn=ceiling((r.groupcount+1)*0.5)

    and r1.rn=floor((r1.groupcount+1)*0.5)

    And perf:

    SQL Server Execution Times:

    CPU time = 30969 ms, elapsed time = 32954 ms.

    So - half as fast as the math Median, but twice as fast as the 2000 version of the statistical median....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4/29/2008)


    (Jeff - psst)....

    (1000000 row(s) affected)

    =====================

    Jeff's Statistical Median Calc

    =====================

    (70266 row(s) affected)

    SQL Server Execution Times:

    CPU time = 70031 ms, elapsed time = 42039 ms.

    ========================

    Matt's Mathematical Median Calc

    ========================

    (70266 row(s) affected)

    SQL Server Execution Times:

    CPU time = 16938 ms, elapsed time = 18221 ms.

    So in case someone doesn't specify which one they want - you might get away with the "faster" one....:)

    Outstanding! Heh... that's what I get for not testing both sides first.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Say... do you happen to have the full test slab for that 16 second version?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/29/2008)


    Say... do you happen to have the full test slab for that 16 second version?

    Sure - I stole your test setup....

    --===== If the temp table already exists, drop it

    IF OBJECT_ID('TempDB..#Steps') IS NOT NULL

    DROP TABLE #Steps

    --===== Simulate loading your real temp table with ...

    -- A MILLION rows, 17576 reps (3 character "names"), a DeltaT from 0 to 1999,

    -- and 4 "Steps" (Step 1, Step 2, etc)

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS RowNum,

    CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65))

    + CHAR(STR(RAND(CAST(NEWID() AS VARBINARY))*25+65)) AS Rep,

    CAST(RAND(CAST(NEWID() AS VARBINARY))*2000 AS INT) AS DeltaT,

    'Step '

    + CAST(CAST(RAND(CAST(NEWID() AS VARBINARY))*4 AS INT)+1 AS VARCHAR(25)) AS MileCode

    INTO #Steps

    FROM Master.dbo.SYSCOLUMNS sc1,

    Master.dbo.SYSCOLUMNS sc2

    --===== Every table, even temporary ones, should have a Primary Key...

    -- Makes all the difference in the world, performance wise, on this script.

    ALTER TABLE #Steps

    ADD PRIMARY KEY CLUSTERED (RowNum)

    --===== And, trust me, you want this index. It's a "covering" index for both the

    -- ORDER BY's, the JOINs, and the data that will appear in the code that follows.

    CREATE INDEX IX_Steps ON #Steps

    (Rep,MileCode,DeltaT)

    go

    --===== Return the Median for ALL reps by name and by step.

    -- It returns all the "analysis" rows in just 32 seconds.

    -- 1 MILLION rows, 17576 Reps, and 4 steps are simulated in the table

    -- (note that because this is all random data, some reps may be missing a step, but most have 4)

    -- for a total of about 70,200 Medians in just 32 seconds

    -- (or 1 Median analysis in on 0.0004558 seconds)

    Print replicate('=',50)

    print 'Jeffs Statistical Median'

    Print replicate('=',50)

    Set statistics time on

    SELECT DISTINCT

    m.Rep,

    m.MileCode,

    (

    (SELECT TOP 1 DeltaT FROM

    (SELECT TOP 50 PERCENT DeltaT FROM #Steps t

    WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode

    ORDER BY DeltaT ASC

    ) lo

    ORDER BY DeltaT DESC)

    +(SELECT TOP 1 DeltaT FROM

    (SELECT TOP 50 PERCENT DeltaT FROM #Steps t

    WHERE t.Rep = m.Rep AND t.MileCode = m.MileCode

    ORDER BY DeltaT DESC

    ) hi

    ORDER BY DeltaT ASC))

    / 2 AS MEDIAN

    FROM #Steps m

    ORDER BY m.Rep,m.MileCode

    Set statistics time off

    Print replicate('=',50)

    print 'Windowed Statistical Median'

    Print replicate('=',50)

    Set statistics time on

    select r.Rep,r.MileCode, (r.deltat +r1.DeltaT)*.5 dt from

    (select m.Rep,m.MileCode,

    deltat,

    row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,

    Count(*) over (partition by m.Rep,m.MileCode) groupcount

    from #Steps m) r inner join

    (select m.Rep,m.MileCode,

    deltat,

    row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,

    Count(*) over (partition by m.Rep,m.MileCode) groupcount

    from #Steps m) r1 on r.rep=r1.rep and r.milecode=r1.milecode

    where r.rn=ceiling((r.groupcount+1)*0.5)

    and r1.rn=floor((r1.groupcount+1)*0.5)

    Set statistics time off

    Print replicate('=',50)

    print 'Windowed Math Median'

    Print replicate('=',50)

    Set statistics time on

    select r.Rep,r.MileCode, r.deltat from

    (select m.Rep,m.MileCode,

    deltat,

    row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,

    Count(*) over (partition by m.Rep,m.MileCode) groupcount

    from #Steps m) r

    where r.rn=ceiling((r.groupcount+1)*0.5)

    Set statistics time off

    I was being funny last night - just remember the "math" version and the "stats" version actually give you different results (based on their different definitions)....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • I forgot that my test machine was running a full backup and a virus scan last night...Without the impediments:

    (1000000 row(s) affected)

    ==================================================

    Jeffs Statistical Median

    ==================================================

    (70271 row(s) affected)

    SQL Server Execution Times:

    CPU time = 22266 ms, elapsed time = 12282 ms.

    ==================================================

    Windowed Statistical Median

    ==================================================

    (70271 row(s) affected)

    SQL Server Execution Times:

    CPU time = 11765 ms, elapsed time = 11950 ms.

    ==================================================

    Windowed Math Median

    ==================================================

    (70271 row(s) affected)

    SQL Server Execution Times:

    CPU time = 5719 ms, elapsed time = 5727 ms.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (4/30/2008)


    I was being funny last night - just remember the "math" version and the "stats" version actually give you different results (based on their different definitions)....

    Not a problem... I'm just interested in how you're doing this and wanted to play a bit. Thanks, Matt.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/30/2008)


    Matt Miller (4/30/2008)


    I was being funny last night - just remember the "math" version and the "stats" version actually give you different results (based on their different definitions)....

    Not a problem... I'm just interested in how you're doing this and wanted to play a bit. Thanks, Matt.

    By the way - once the caffeine infusion kicked in this morning - I realized that I'm taking two passes to do what I can do in just one in the "windowed Stat version".

    Set statistics time off

    Print replicate('=',50)

    print 'Windowed Stat Median - method #2'

    Print replicate('=',50)

    Set statistics time on

    select r.Rep,r.MileCode, Avg(r.deltat) dt

    from

    (select m.Rep,m.MileCode,

    deltat,

    row_number() over (partition by m.Rep,m.MileCode order by DeltaT) RN,

    Count(*) over (partition by m.Rep,m.MileCode) groupcount

    from #Steps m) r

    where r.rn = ceiling((r.groupcount+1)*0.5) or

    r.rn = floor((r.groupcount+1)*0.5)

    group by r.Rep,r.MileCode

    order by r.Rep,r.MileCode

    Set statistics time off

    Damn near as fast as the math median at this point....

    SQL Server Execution Times:

    CPU time = 6266 ms, elapsed time = 6421 ms.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 16 through 29 (of 29 total)

You must be logged in to reply to this topic. Login to reply