Histogram chart in sql

  • Hi,

    Im hoping someone is able to check what Ive created for a histogram chart and can provide feedback, suggestions, improvements.

    (Im pretty sure there will be a few).

    I decided to write my own for a business requirement and also after reading this great article from Dwain.

    http://www.sqlservercentral.com/articles/Excel/91179/

    As everyone keeps saying if you don't understand the code then don't use it (I didn't understand enough of Dwains code to be comfortable in using it in prod).

    Recursive cte's are something Ive never required and really don't have the know how in them just yet.

    Ive borrowed (ok stolen - thanks Dwain!) Dwains initial setup code and removed all the items except those relating to float

    DECLARE @NumberOfRows INT

    ,@FStartValue FLOAT

    ,@FEndValue FLOAT

    ,@FRange FLOAT

    SELECT @NumberOfRows = 10000

    -- -- Random FLOATs on the closed interval {275, 850}

    ,@FStartValue = 275

    ,@FEndValue = 850

    ,@FRange = @FEndValue - @FStartValue

    DECLARE @MyTestTable TABLE (

    RandomFloat FLOAT

    )

    INSERT INTO @MyTestTable

    SELECT TOP (@NumberOfRows)

    RandomFloat = RAND(CHECKSUM(NEWID())) * @FRange + @FStartValue

    FROM sys.all_columns a1

    CROSS JOIN sys.all_columns a2

    So here is my code - I needed the outer join to the Tally table so I could get a guaranteed unbroken (no gaps) list of bins.

    Im hoping the code is self explanatory but if needed (asked for) I will add in some comments.

    declare @bins as int = 40

    declare @interval as float

    declare @min-2 as float

    declare @max-2 as float

    select @interval = (max(RandomFloat) - min(RandomFloat))/@bins, @min-2 = min(RandomFloat), @max-2 = max(RandomFloat)

    from @MyTestTable

    ;with cteHistogram (N, StartBin, EndBin, TotalCount)

    as

    (

    select

    N, @min-2+(@interval*(N-1)) StartBin, @min-2+(@interval*N) EndBin, coalesce(count(RandomFloat), 0) TotalCount

    from

    @MyTestTable cross join Tally

    where

    RandomFloat >= @min-2+(@interval*(N-1)) and RandomFloat < @min-2+(@interval*N) and n <= @bins-1

    group by

    N

    union all

    select

    N, @min-2+(@interval*(N-1)) StartBin, @min-2+(@interval*N) EndBin, coalesce(count(RandomFloat), 0) TotalCount

    from

    @MyTestTable cross join Tally

    where

    RandomFloat >= @min-2+(@interval*(N-1)) and RandomFloat <= @max-2 and n = @bins

    group by

    N

    )

    select

    Tally.N, @min-2+(@interval*(Tally.N-1)) StartBin, case when Tally.N = @bins then @max-2 else @min-2+(@interval*Tally.N) end EndBin, coalesce(TotalCount, 0) TotalCount

    from

    Tally left outer join cteHistogram on Tally.N = cteHistogram.N

    where

    Tally.N <= @bins

    order by

    Tally.N

    ps Im a beginner in Tally tables but have recently started using them quite a bit and absolutely love them.

    Im also assuming everyone has a Tally table called "Tally" or can make one easily enough.

  • It looks ok to me, apart from UNION picking up the 40th bin. When this is done, the query is quite efficient:

    ;with cteHistogram (N, StartBin, EndBin, TotalCount)

    as

    (

    select

    N,

    @min-2+(@interval*(N-1)) StartBin,

    @min-2+(@interval*N) EndBin, coalesce(count(RandomFloat), 0) TotalCount

    from @MyTestTable

    cross join Tally

    where RandomFloat >= @min-2+(@interval*(N-1))

    and ((RandomFloat < @min-2+(@interval*N) and n <= @bins-1)

    OR (RandomFloat <= @max-2 and n = @bins))

    group by N

    )

    select

    Tally.N, @min-2+(@interval*(Tally.N-1)) StartBin, case when Tally.N = @bins then @max-2 else @min-2+(@interval*Tally.N) end EndBin, coalesce(TotalCount, 0) TotalCount

    from

    Tally left outer join cteHistogram on Tally.N = cteHistogram.N

    where

    Tally.N <= @bins

    order by

    Tally.N

    Note that min and max are the calculated range min and max values, rather than RandomFloat values. If you want RandomFloat values, try this:

    -------------------------------------------------------------------------

    -- calculate the range start and end for each bin

    ;WITH Bins AS (

    SELECT

    t.n,

    RangeStart = minval + (interval * (t.n-1)),

    RangeEnd = CASE WHEN t.n = @bins THEN maxval+1 ELSE minval + (interval * t.n) END

    FROM (

    SELECT

    interval = (MAX(RandomFloat) - MIN(RandomFloat)) / @bins,

    minval = MIN(RandomFloat),

    maxval = MAX(RandomFloat)

    FROM @MyTestTable

    ) d

    CROSS JOIN Tally t

    WHERE t.n <= @bins

    )

    -- Summarise the bin contents. Min and max are actual bins.

    SELECT

    b.N,

    StartBin = MIN(RandomFloat),

    EndBin = MAX(RandomFloat),

    RangeStart = MIN(b.RangeStart),

    RangeEnd = MIN(b.RangeEnd),

    TotalCount = COUNT(*)

    FROM Bins b

    INNER JOIN @MyTestTable t

    ON t.RandomFloat >= b.RangeStart AND t.RandomFloat < b.RangeEnd

    GROUP BY N

    Now, if your requirement was to split the population into 40 equal groups and summarise those groups, you could use NTILE, like this:

    -- split the set into (@bins) roughly equal-sized populations

    SELECT

    N,

    StartBin = MIN(RandomFloat),

    EndBin = MAX(RandomFloat),

    TotalCount = COUNT(*)

    FROM (

    SELECT

    RandomFloat,

    N = NTILE(@bins) OVER(ORDER BY (RandomFloat))

    FROM @MyTestTable

    ) d

    GROUP BY N

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • matak (1/17/2013)


    I decided to write my own for a business requirement and also after reading this great article from Dwain.

    http://www.sqlservercentral.com/articles/Excel/91179/

    As everyone keeps saying if you don't understand the code then don't use it (I didn't understand enough of Dwains code to be comfortable in using it in prod).

    Recursive cte's are something Ive never required and really don't have the know how in them just yet.

    Ive borrowed (ok stolen - thanks Dwain!) Dwains initial setup code and removed all the items except those relating to float

    You're welcome! Looks like ChrisM has got ya covered.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks Chris,

    Appreciate the feedback.

    Looks like i was close so thats always a bonus.

    Ive never used ntile before and it certainly looks interesting (although its not suitable for this requirement).

    Initially i think i will use your first suggestion and see how that goes.

    Thanks (and to dwain for the setup code - will come in handy)

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply