Optimizing an index for select count(distinct)

  • Hi,

    I have a table logging our website activity with about 60 M of rows. One user has for itself more than 1 M rows. When I do a count(distinct), it takes two seconds. I was wondering if there is a way to speed that up?

    Here the table

    CREATE TABLE [dbo].[tbl_logs](

    [id] [int] NOT NULL,

    [dateAdded] [datetime],

    [fk_tbl_visits_id] [int],

    [fk_tbl_users_id] [int]

    ) ON [PRIMARY]

    I tried different combinaison of indexes, but here the index that performed the best.

    CREATE NONCLUSTERED INDEX [idx_visits] ON [dbo].[tbl_logs]

    (

    [fk_tbl_users_id] ASC,

    [dateAdded] ASC,

    [fk_tbl_visits_id] ASC

    )WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

    Curiously, that clustered index was never taken by the query optimizer when the one above was created. I thought clustered indexed would be better for that query...

    CREATE CLUSTERED INDEX [idx_primary] ON [dbo].[tbl_logs]

    (

    [fk_tbl_users_id] ASC,

    [dateAdded] ASC,

    [fk_tbl_visits_id] ASC

    )WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

    And here's the query that return ~1 M rows.

    select count(distinct fk_tbl_visits_id) as nb from tbl_logs

    where fk_tbl_users_id = 10203 and dateadded between '2008-01-01 00:00:00' and '2008-05-31 23:59:59:999'

    The execution plan shows an index seek of 40% and next, a hash match taking 55%.

    Any idea of how to reduce the execution time?

    Thanks

    Stephane

  • Can you post the execution plan in its xml format please? (saved as a .sqlplan file zipped and attached to a post)

    What indexes do you currently have on the table?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Here it is. Some objects name might be different (tbl_tmp vs tbl_logs) but it's the same result.

    There are only two indexes, those I gave the definition.

    Thanks

    Stephane

  • Looking at the execution plan, I'd try creating a second index, just on fk_tbl_visits_id. There's no guarantee it's going to work. 1m rows out 60m has a .0167 ratio, which is probably not selective enough by itself for the index to work well...

    Also, you might try, no guarantees, putting the two date columns into a cluster and this other column into an index of its own.

    You've got an index seek, not a scan, so you're ahead of the game. It's just finding a way to get a more efficient aggregation than the Hash Match you're currently seeing. You might not be able to.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi,

    Finally, after a lots of tests, this is the fastest index (note the include column). It executes the query in 2 seconds.

    CREATE NONCLUSTERED INDEX [idx_date] ON [dbo].[tbl_logs]

    (

    [fk_tbl_users_id] ASC,

    [dateAdded] ASC

    )

    INCLUDE ( [fk_tbl_visits_id]) WITH (PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF) ON [PRIMARY]

    I tried many combination of clustered (id, date which is not unique, date + id) index but it always took 6 seconds or more.

    I also tried separate indexes (one for user id, one for the date and one for the visit id) but it was also very long (15 seconds)

    I'm now short of ideas to find out how to reduce execution time.

    As a comparaison when I check Google Analytics, it process reports in less than a second. I know we don't have Google servers, but we don't have as much data too. Does anyone know how they query their data?

    Thanks

    Stephane

  • If you need to increase the search speed even further you could create an INDEXED VIEW and run your query on it.

    the view definition would be like:

    select fk_tbl_users_id, fk_tbl_visits_id, date, count_big(*) cnt

    from tbl_log

    group by fk_tbl_users_id, fk_tbl_visits_id, date

    "date" would have to be defined as a calculated column

    then you would run your query on top of this "preagregated" data.


    * Noel

  • Part of the problem is returning 1 million rows. Does the user need a million rows? Are they really using a million rows? Are they going to look at 1 million rows? Does the average user look at 1 thousand rows? Not in my experience. You'd do better with a query that returns 200 rows but says that 1 million are available.

    You might need some paging process then though.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hum, of course, the user doesn't need a million of rows. The query is about getting the count! ๐Ÿ˜‰

    select count(distinct fk_tbl_visits_id) as nb from tbl_logs

    where fk_tbl_users_id = 10203 and dateadded between '2008-01-01 00:00:00' and '2008-05-31 23:59:59:999'

    Stephane

  • Duly chastised and I've proven my lack of reading ability, yet again.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • what is a 'visits_id'? It seems that the 'distinct' is what's hosing the computation. checking for duplicates accress 1mil rows is a lot of checking. Is that what the hash is?

    How is 'visits_id' used that it must be distinct?

    Not sure if any of these questions even help. Thought it was atleast worth an ask.

  • Have you tried analysing the Query in Database Engine Tuning Advisor?

  • Good idea.

    I tried and it says ยซ Estimated Improvement 0% ยป... That's a dead end I think! ๐Ÿ˜€

  • I bet that answers your question ๐Ÿ˜Ž

  • stephane3d (6/17/2008)


    Good idea.

    I tried and it says ยซ Estimated Improvement 0% ยป... That's a dead end I think! ๐Ÿ˜€

    Did you try the "indexed view" idea ?


    * Noel

  • No, I didn't consider that actually.

    Can an indexed view be faster than an index on the table itself?

    Thanks

    Stephane

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply