Optimizing an index for select count(distinct)

Question

Post reply

Optimizing an index for select count(distinct)

stephane3d

Ten Centuries

Points: 1129
More actions
June 12, 2008 at 5:27 pm

#187880

Hi,
I have a table logging our website activity with about 60 M of rows. One user has for itself more than 1 M rows. When I do a count(distinct), it takes two seconds. I was wondering if there is a way to speed that up?
Here the table
CREATE TABLE [dbo].[tbl_logs](
[id] [int] NOT NULL,
[dateAdded] [datetime],
[fk_tbl_visits_id] [int],
[fk_tbl_users_id] [int]
) ON [PRIMARY]
I tried different combinaison of indexes, but here the index that performed the best.
CREATE NONCLUSTERED INDEX [idx_visits] ON [dbo].[tbl_logs]
(
[fk_tbl_users_id] ASC,
[dateAdded] ASC,
[fk_tbl_visits_id] ASC
)WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
Curiously, that clustered index was never taken by the query optimizer when the one above was created. I thought clustered indexed would be better for that query...
CREATE CLUSTERED INDEX [idx_primary] ON [dbo].[tbl_logs]
(
[fk_tbl_users_id] ASC,
[dateAdded] ASC,
[fk_tbl_visits_id] ASC
)WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]
And here's the query that return ~1 M rows.
select count(distinct fk_tbl_visits_id) as nb from tbl_logs
where fk_tbl_users_id = 10203 and dateadded between '2008-01-01 00:00:00' and '2008-05-31 23:59:59:999'
The execution plan shows an index seek of 40% and next, a hash match taking 55%.
Any idea of how to reduce the execution time?
Thanks
Stephane

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004514 More actions · Answer 1

Can you post the execution plan in its xml format please? (saved as a .sqlplan file zipped and attached to a post)

What indexes do you currently have on the table?

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

stephane3d Ten Centuries Points: 1129 More actions · Answer 2

Here it is. Some objects name might be different (tbl_tmp vs tbl_logs) but it's the same result.

There are only two indexes, those I gave the definition.

Thanks

Stephane

Grant Fritchey SSC Guru Points: 399333 More actions · Answer 3

Looking at the execution plan, I'd try creating a second index, just on fk_tbl_visits_id. There's no guarantee it's going to work. 1m rows out 60m has a .0167 ratio, which is probably not selective enough by itself for the index to work well...

Also, you might try, no guarantees, putting the two date columns into a cluster and this other column into an index of its own.

You've got an index seek, not a scan, so you're ahead of the game. It's just finding a way to get a more efficient aggregation than the Hash Match you're currently seeing. You might not be able to.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

stephane3d Ten Centuries Points: 1129 More actions · Answer 4

Hi,

Finally, after a lots of tests, this is the fastest index (note the include column). It executes the query in 2 seconds.

CREATE NONCLUSTERED INDEX [idx_date] ON [dbo].[tbl_logs]

(

[fk_tbl_users_id] ASC,

[dateAdded] ASC

)

INCLUDE ( [fk_tbl_visits_id]) WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

I tried many combination of clustered (id, date which is not unique, date + id) index but it always took 6 seconds or more.

I also tried separate indexes (one for user id, one for the date and one for the visit id) but it was also very long (15 seconds)

I'm now short of ideas to find out how to reduce execution time.

As a comparaison when I check Google Analytics, it process reports in less than a second. I know we don't have Google servers, but we don't have as much data too. Does anyone know how they query their data?

Thanks

Stephane

noeld SSC Guru Points: 96590 More actions · Answer 5

If you need to increase the search speed even further you could create an INDEXED VIEW and run your query on it.

the view definition would be like:

select fk_tbl_users_id, fk_tbl_visits_id, date, count_big(*) cnt

from tbl_log

group by fk_tbl_users_id, fk_tbl_visits_id, date

"date" would have to be defined as a calculated column

then you would run your query on top of this "preagregated" data.

* Noel

Grant Fritchey SSC Guru Points: 399333 More actions · Answer 6

Part of the problem is returning 1 million rows. Does the user need a million rows? Are they really using a million rows? Are they going to look at 1 million rows? Does the average user look at 1 thousand rows? Not in my experience. You'd do better with a query that returns 200 rows but says that 1 million are available.

You might need some paging process then though.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

stephane3d Ten Centuries Points: 1129 More actions · Answer 7

Hum, of course, the user doesn't need a million of rows. The query is about getting the count! 😉

select count(distinct fk_tbl_visits_id) as nb from tbl_logs

where fk_tbl_users_id = 10203 and dateadded between '2008-01-01 00:00:00' and '2008-05-31 23:59:59:999'

Stephane

Grant Fritchey SSC Guru Points: 399333 More actions · Answer 8

Duly chastised and I've proven my lack of reading ability, yet again.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

bcronce SSCrazy Points: 2597 More actions · Answer 9

what is a 'visits_id'? It seems that the 'distinct' is what's hosing the computation. checking for duplicates accress 1mil rows is a lot of checking. Is that what the hash is?

How is 'visits_id' used that it must be distinct?

Not sure if any of these questions even help. Thought it was atleast worth an ask.

Jonathan Mallia SSCertifiable Points: 5192 More actions · Answer 10

Have you tried analysing the Query in Database Engine Tuning Advisor?

stephane3d Ten Centuries Points: 1129 More actions · Answer 11

Good idea.

I tried and it says « Estimated Improvement 0% »... That's a dead end I think! 😀

Jonathan Mallia SSCertifiable Points: 5192 More actions · Answer 12

Jonathan Mallia

SSCertifiable

Points: 5192

June 18, 2008 at 12:41 am

#830220

I bet that answers your question 😎

noeld SSC Guru Points: 96590 More actions · Answer 13

stephane3d (6/17/2008)
Good idea.
I tried and it says « Estimated Improvement 0% »... That's a dead end I think! 😀

Did you try the "indexed view" idea ?

* Noel

stephane3d Ten Centuries Points: 1129 More actions · Answer 14

No, I didn't consider that actually.

Can an indexed view be faster than an index on the table itself?

Thanks

Stephane