January 22, 2008 at 4:58 pm
Hi,
I was wondering which one would give the best performance : A group by on a table not normalized or a join on two normalized tables.
Let's say I have a weblog table with 100 M+ rows. If I normalize it, I would have a foreign key for the requested web page.
create table logs (
id int primary key,
date datetime,
fk_pages_id int
)
create table pages (
id int primary key,
pageUrl varchar(250)
)
If I don't normalize it, it would go like this:
create table logs (
id int primary key,
date datetime,
pageUrl varchar(250)
)
So if I want to know the number of time each pages has been viewed, to get the fastest select, should I go with a join (with an index on logs.fk_pages_id)?
select count(1), pageUrl from logs join pages on logs.fk_pages_id = pages.id where date between '2007-12-01' and '2008-01-01' group by fk_pages_id
or should I go with a group on the pageUrl (with an index on pageUrl)?
select count(1), pageUrl from logs where date between '2007-12-01' and '2008-01-01' group by pageUrl
I made some tests, but I'm still not sure. What are the best practices to get the fastest select on a huge table? Is a group by on a indexed varchar column could be faster than a join?
Any idea are welcome.
Thanks,
Stephane
January 23, 2008 at 8:31 am
First off, the second sample query won't work because the urlpage isn't included in the group by.
For the denormed table, I found that setting the clustered index to the date field gave me the best performance. For the other I found that putting the cluster on the date column in the logs table and the Id in the pages table was fastes. Overall, the denorm was faster because it just didn't have to do the join.
Without the clustered indexes (using the defaults on the key values) for both sets of tables resulted in paralellism in the plans and they performed quite slow. With the clustered index changes outlined above I got the following (1 million rows, not 100 million).
Denorm
11ms elapsed time, 1 scan, 15 logical reads
Norm
20ms elapsed time, 1 scan, 8 reads on logs & 72000 reads on pages
(there's probably something to be done there, but with only two columns to work with, I tried every combination and it all came out the same)
The question isn't only which is faster, the question is, do you pay a cost or derive a benefit from having the data normalized? Basically, are there other access paths through the data that are more likely than others or more used than others? If so, you need to design based on those access paths. But the short answer is, the denormalized approach was marginally faster, but considerably less costly.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 23, 2008 at 8:52 am
As a matter of fact, all accesses will be similar to that query. So if I understand, I should use the denormalized table.
Ok, I'll do more test in that way.
Thanks a lot!
Stephane
Viewing 3 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply