• Gail, your post about the importance of exact measuring changes one by one. You are really good in this!

    Another aspect of optimization is using what I call "the task insider info". Maybe in other articles you point it - I don't read it, so my apologize 😉

    1. The first thing I see in the original query is that we have two entities (tasks-posts), that linked to each other not only by formal ids, but by time-relation too. I mean that task should be always older than post. Am I right?

    If I, than this should be always true: "t.CreatedOn <= p.PostDate". And we can add extra constraint, that helps us a lot:

    ... and p.PostDate >= @StartDate

    if PostDate >= CreatedOn and CreatedOn >= @StartDate than PostDate >= @StartDate also.

    Obviously, we can't add expression, that links PostDate and @EndDate, because @EndDate about tread, not posts in it. tread can contain new posts.

    2. We can add index (query optimizer advice one), and this can helps us a lot - we can scan only one half of index in average.

    3. This is good, but we can achieve best results, if we can do some denormalization - add TaskCreatedOn field to posts table. So we can add index on it and fully utilize it in query. If quering db in this way important - denormalization is ok. It is only 3 bytes more on every row (if we use date data type). We add only one more condition to query:

    ... and p.CreatedOn between @StartDate and @EndDate

    I suppose only this one denormalization trik and extra condition will perform better, than all, that was done in the orginal article (can you test it on your data?).

    4. After that we can test another indexes on other tables (listed in the article).

    5. If we have a lot of time to do some exercises, we can try to avoid sorting and stream aggregating in the query. We can shift work to the proper index in combination with proper quering. First of all, there is group by operation in the query and one aggregation - "count(*)":

    SELECT

    ...

    ,COUNT(*) AS ThreadPostCount

    FROM

    dbo.Forums AS f

    INNER JOIN dbo.Threads AS t ON

    f.ForumID = t.ForumID

    INNER JOIN dbo.Posts AS p ON

    t.ThreadID = p.ThreadID

    INNER JOIN dbo.Users AS u ON

    p.Poster = u.UserID

    ...

    GROUP BY

    f.ForumName

    ,DATEADD(MONTH, DATEDIFF(MONTH, 0, p.PostDate), 0)

    ,t.Title

    ,u.UserName

    If look at this operations, we can see, that for the purpose of counting (and group by) we actually can use only posts table (with some restrictions - query result can be different, but only a bit):

    f.ForumName doesn't matter (original query doesn't contain order by clause, so, we can think so 😉 )

    DATEADD(MONTH, DATEDIFF(MONTH, 0, p.PostDate), 0) should be converted in materialized calculation on the posts

    t.Title can be replaced with p.ThreadID (if we have no treads with duplicated title, the results be identical)

    u.UserName changed to p.Poster (the same as above)

    so, the query should look something like this:

    with postPreFiltered as

    (

    select

    --cast(dateadd(month, datediff(month, 0, PostDate), 0) as date) AS Month

    p.Month

    ,p.ThreadID

    ,p.Poster

    from

    dbo.Posts p

    where

    --p.PostDate >= @StartDate

    p.TaskCreatedOn between @StartDate AND @EndDate

    )

    ,

    postAggregated as

    (

    select

    *

    from

    (

    select

    *

    ,count(*) over

    (

    partition by

    p.Month

    ,p.ThreadID

    ,p.Poster

    ) as ThreadPostCount

    ,row_number() over

    (

    partition by

    p.Month

    ,p.ThreadID

    ,p.Poster

    order by

    (select 1)

    ) as SingleRow

    from

    postPreFiltered p

    ) t

    where

    SingleRow = 1

    )

    select

    f.ForumName

    ,p.Month

    ,t.Title

    ,u.UserName

    ,p.ThreadPostCount

    from

    dbo.Forums AS f

    inner join dbo.Threads AS t ON

    f.ForumID = t.ForumID

    inner join postAggregated /*dbo.Posts*/ AS p ON

    t.ThreadID = p.ThreadID

    inner join dbo.Users AS u ON

    p.Poster = u.UserID

    where

    t.CreatedOn between @StartDate AND @EndDate

    and

    f.IsPublic = 1

    ;

    so we use window's functions on one table, now we add proper index, that helps avoid sort and stream aggregate. This optimization is very data distribution dependent, and I don't really think, that Gail (the author of the original article) really want a non-sorted query, because the user will see unsorted data - primarily index build on the ids, not strings ;).

    I downloaded archive with db from the previous article and played it a bit (it's a pity, that it's not 25 million rows), anyway, I have a lot of fun trying to optimize the original query! Thank you, Gail!