What's the best way to create à clustered index?

  • Hi,

    I keep wondering what's the best way to create a clustered index. I tried many ways. Some make the query fast even though I think it could be faster, others are pretty slow, even though it shows an "clustered index seek" in the performance plan.

    So here's the thing. I have a logs table for a custom web log analyzer. It goes like this:

    id bigint, logDate datetime, fk_websites_id int, logFields N...

    I have an id because two logs can have the same date. Rarely, but it happens. The fk_websites_id refers to a websites table because we monitor around 1000 websites. So far, we have logged around 100 000 000 rows with dates ranging from 3 years ago to present.

    Now, the queries are always something like

    select SomeFields from tbl_logs where fk_websites_id = 100000 and logDate between dateFrom and dateTo

    The where clause might be more complicated by adding others fields (field_1 = someValue and field_2 like 'blabla%'), but basically, the website_id and the logDate are always required.

    I tried several combinations of the clustered index:

    logDate, fk_websites_id, id (this is slow)

    fk_websites_id, logDate, id (this is the fastest)

    id, logDate (I also have a non-clustered index on fk_websites_id)

    logDate, id (I also have a non-clustered index on fk_websites_id)

    Along with theses combinations, I tried several combinations of the where clause putting the fk_websites_id before the logDate and so on.

    So I need some piece of advice here. What would give the best results for a clustered index? What's the golden rule of creating such clustered index for a log table?

    Thanks for any idea

    Stephane

  • I think the best strategy is to create Clustered Index on a column(s) that gives you a unique value per row.

    You should go for indexes on the columns that are frequently used in the SELECT Column list and the columns that are used in the where clause.

    The columns on which the aggregate functions are applied should be the part of INCLUDE.

    This strategy is recommended in many books and also there are many questions in MCITP-442 dumps regarding theses type of scenarios.

    Atif Sheikh

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sometimes, winning is not an issue but trying.
    You can check my BLOG
    [font="Arial Black"]here[/font][/url][/right]

  • I may be missing something, but this seems odd to me. You monitor approximately 1,000 websites, and have been doing so for about 3 years, with the normal traffic to the table being 1 entry per day, but somehow you have 100-million rows. 1000 websites * 3 years * 365 days per year = just over 1-million entries. What am I missing?

    The reason I ask is because insert volume matters a lot in defining a clustered index.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • You're right, it might be confusing!

    It's not an entry per day, but an entry for each web page visited. So in a single day, we could have 100 000 or 500 000 logs.

    Thanks

    Stephane

  • Just remember that the clustered index also stores the rest of the data with the table. You basically need to cluster on the most common access path to your data. I wouldn't recommend going for unique values necessarily. Rather I'd focus on probably making it a compound index based on the ID and the DATE probably. It may turn out to be unique, but that's not the focus.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Just remember that the clustered index also stores the rest of the data with the table. You basically need to cluster on the most common access path to your data. I wouldn't recommend going for unique values necessarily. Rather I'd focus on probably making it a compound index based on the ID and the DATE probably. It may turn out to be unique, but that's not the focus.

    If you want to post a couple of execution plans, it would be easier to make choices.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • stephane3d (8/20/2008)


    What's the golden rule of creating such clustered index for a log table?

    In depends. < Ducks and runs >

    Seriously, there aren't any hard and fast rules on where indexes should go. It depends on the table structure, on the data and on the queries. Also, with the cluster, insert and update patterns matter a lot.

    The usual recomendations for a clustered index are narrow, unique, non-changing, ascending values (like date inserted or identity). There are times though when breaking one or more of those makes sense.

    Take a look at the most frequently run queries on the table, and consider what indexes would support them. Clusters can be good for range queries or for supporting an order by.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Gut feel on this one it that (fk_websiteID, logdate) may be a good choice, though without seeing exec plans I can't say for sure.

    Reason is that you say they are both always in the condition. They're fairly narrow. They're probably not unique, but shouldn't be too far from unique if the logdate is a full datetime. The websiteID needs to be first since you are doing the equality on that and a range on the logdate.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I'd go with the suggestion of websiteID and logdate, but also keep in mind that you'll want to test the fill factor on that, since you won't get getting fully sequential inserts, and the table/index might fragment too rapidly if you set the fill factor too high.

    Maybe start with 80%, and refine from there based on tests and monitoring.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply