Calculating metric distributions

  • Hi there,

    Does anyone have any useful code that I can use to show distributions of data ? Here's some test rows for 1 metric.

    1

    1

    2

    2

    2

    4

    5

    6

    7

    8

    8

    9

    10

    What I want to do is to break the rows into deciles (so, 10 segments) and count the number of rows in each decile. The range of the deciles needs to be calculated by finding the min and max values and then dividing the difference by 10 (almost, I think).

    So, in the example, Decile 2 would equal 3, as there are 3 rows with a value of 2, and the decile range is 10 segments of 1.

    I'll be doing some customer profiling work, and thought this would make it easier !

    Regards, Greg

  • Greg

    Here is a solution based on your criteria, which is to simply use the minimum and maximum values of existing data as the endpoints of your distribution, then divide by 10. In essence, you are creating an interval width. Your situation might require this, but I think you might be well served by setting up some static frequency parameters. For example, with your data, the max(10) - min(1) / 10 = .9. For your data, it does not make sense to have the interval for the first decile to be 1 to 1.9, since there are no decimal values in your data. It might make sense to use the min and max possible values instead, then divide that value by 10. And maybe you are already doing this. Anyhow, see below.

    IF OBJECT_ID('TempDB..#data','u') IS NOT NULL

    DROP TABLE #data

    GO

    CREATE TABLE #data

    (

    Datum INT

    )

    GO

    INSERT INTO #data

    SELECT 1 UNION ALL

    SELECT 1 UNION ALL

    SELECT 2 UNION ALL

    SELECT 2 UNION ALL

    SELECT 2 UNION ALL

    SELECT 4 UNION ALL

    SELECT 5 UNION ALL

    SELECT 6 UNION ALL

    SELECT 7 UNION ALL

    SELECT 8 UNION ALL

    SELECT 8 UNION ALL

    SELECT 9 UNION ALL

    SELECT 10

    GO

    DECLARE @min-2 INT, @max-2 INT, @width FLOAT

    SELECT @min-2 = (SELECT MIN(datum) FROM #data)

    SELECT @max-2 = (SELECT MAX(datum) FROM #data)

    SELECT @width = (@max-@min)/(10*1.0)

    SELECT

    Datum,

    Frequency = COUNT(datum),

    Decile = CASE WHEN datum <= (@min + (@width * 1)) THEN '1st'

    WHEN datum > (@min + (@width * 1)) AND datum <= (@min + (@width * 2)) THEN '2nd'

    WHEN datum > (@min + (@width * 2)) AND datum <= (@min + (@width * 3)) THEN '3rd'

    WHEN datum > (@min + (@width * 3)) AND datum <= (@min + (@width * 4)) THEN '4th'

    WHEN datum > (@min + (@width * 4)) AND datum <= (@min + (@width * 5)) THEN '5th'

    WHEN datum > (@min + (@width * 5)) AND datum <= (@min + (@width * 6)) THEN '6th'

    WHEN datum > (@min + (@width * 6)) AND datum <= (@min + (@width * 7)) THEN '7th'

    WHEN datum > (@min + (@width * 7)) AND datum <= (@min + (@width * 8)) THEN '8th'

    WHEN datum > (@min + (@width * 8)) AND datum <= (@min + (@width * 9)) THEN '9th'

    WHEN datum > (@min + (@width * 9)) AND datum <= @max-2 THEN '10th'

    ELSE NULL

    END

    FROM #data

    GROUP BY datum

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Greg

    On second glance, I think I got it wrong the first time. I think this is what you want.

    DECLARE @min-2 INT, @max-2 INT, @width FLOAT

    SELECT @min-2 = (SELECT MIN(datum) FROM #data)

    SELECT @max-2 = (SELECT MAX(datum) FROM #data)

    SELECT @width = (@max-@min)/(10*1.0)

    SELECT

    Decile,

    Frequency = COUNT(decile)

    FROM

    (

    SELECT

    Datum,

    Decile = CASE WHEN datum <= (@min + (@width * 1)) THEN 1

    WHEN datum > (@min + (@width * 1)) AND datum <= (@min + (@width * 2)) THEN 2

    WHEN datum > (@min + (@width * 2)) AND datum <= (@min + (@width * 3)) THEN 3

    WHEN datum > (@min + (@width * 3)) AND datum <= (@min + (@width * 4)) THEN 4

    WHEN datum > (@min + (@width * 4)) AND datum <= (@min + (@width * 5)) THEN 5

    WHEN datum > (@min + (@width * 5)) AND datum <= (@min + (@width * 6)) THEN 6

    WHEN datum > (@min + (@width * 6)) AND datum <= (@min + (@width * 7)) THEN 7

    WHEN datum > (@min + (@width * 7)) AND datum <= (@min + (@width * 8)) THEN 8

    WHEN datum > (@min + (@width * 8)) AND datum <= (@min + (@width * 9)) THEN 9

    WHEN datum > (@min + (@width * 9)) AND datum <= @max-2 THEN 10

    ELSE NULL

    END

    FROM #data

    ) t1

    GROUP BY decile

    Greg
    _________________________________________________________________________________________________
    The glass is at one half capacity: nothing more, nothing less.

  • Thanks Greg,

    That works really, well and I'll just add in the decile range valeus to the output now - great !

    Greg

  • Do you care to see that decile 3 has 0 items?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Not really, it wouldn't be a problem when plotting as a visualisation.

  • greg.bull (7/6/2011)


    Not really, it wouldn't be a problem when plotting as a visualisation.

    I only asked because if you are not interested in pulling a complete set of deciles from SQL Server then I wonder why you're doing this in T-SQL in the first place. I don't know the whole story though. If you are interested in presenting a 0 for the 3rd decile because you are simply dropping the results onto a web page in tabular format I could see using T-SQL for this, in which case you would want that 3 to appear in your results table with a 0 count. However, since you're plotting this (using some type of visualization tool?), is there not a built-in function for calculating deciles that could do this for you outside the database? Just wondering. What are you using to plot/display the results?

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply