Best performance with group by or join?

  • Hi,

    I was wondering which one would give the best performance : A group by on a table not normalized or a join on two normalized tables.

    Let's say I have a weblog table with 100 M+ rows. If I normalize it, I would have a foreign key for the requested web page.

    create table logs (

    id int primary key,

    date datetime,

    fk_pages_id int

    )

    create table pages (

    id int primary key,

    pageUrl varchar(250)

    )

    If I don't normalize it, it would go like this:

    create table logs (

    id int primary key,

    date datetime,

    pageUrl varchar(250)

    )

    So if I want to know the number of time each pages has been viewed, to get the fastest select, should I go with a join (with an index on logs.fk_pages_id)?

    select count(1), pageUrl from logs join pages on logs.fk_pages_id = pages.id where date between '2007-12-01' and '2008-01-01' group by fk_pages_id

    or should I go with a group on the pageUrl (with an index on pageUrl)?

    select count(1), pageUrl from logs where date between '2007-12-01' and '2008-01-01' group by pageUrl

    I made some tests, but I'm still not sure. What are the best practices to get the fastest select on a huge table? Is a group by on a indexed varchar column could be faster than a join?

    Any idea are welcome.

    Thanks,

    Stephane

  • First off, the second sample query won't work because the urlpage isn't included in the group by.

    For the denormed table, I found that setting the clustered index to the date field gave me the best performance. For the other I found that putting the cluster on the date column in the logs table and the Id in the pages table was fastes. Overall, the denorm was faster because it just didn't have to do the join.

    Without the clustered indexes (using the defaults on the key values) for both sets of tables resulted in paralellism in the plans and they performed quite slow. With the clustered index changes outlined above I got the following (1 million rows, not 100 million).

    Denorm

    11ms elapsed time, 1 scan, 15 logical reads

    Norm

    20ms elapsed time, 1 scan, 8 reads on logs & 72000 reads on pages

    (there's probably something to be done there, but with only two columns to work with, I tried every combination and it all came out the same)

    The question isn't only which is faster, the question is, do you pay a cost or derive a benefit from having the data normalized? Basically, are there other access paths through the data that are more likely than others or more used than others? If so, you need to design based on those access paths. But the short answer is, the denormalized approach was marginally faster, but considerably less costly.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • As a matter of fact, all accesses will be similar to that query. So if I understand, I should use the denormalized table.

    Ok, I'll do more test in that way.

    Thanks a lot!

    Stephane

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply