What is proper way of indexing on a table which contain large amount of data

  • Anuj Rathi (2/8/2013)


    "If you have a WHERE clause that includes expressions connected by two or more AND operators, SQL Server will evaluate them from left to right in the order they are written. This assumes that no parenthesis have been used to change the order of execution. Because of this, you may want to consider one of the following when using AND:

    Locate the least likely true AND expression first. This way, if the AND expression is false, the clause will end immediately, saving time.

    If both parts of an AND expression are equally likely being false, put the least complex AND expression first. This way, if it is false, less work will have to be done to evaluate the expression.

    You may want to consider using Query Analyzer or Management Studio to look at the execution plans of your queries to see which is best for your situation. [2000, 2005, 2008] Updated 2-4-2009 "

    That's a load of complete hogswash.

    Operators are not evaluated from left to right, if an expression if false the evaluation does not end.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Once again, I have a situation where my way is again blocked.

    As I said earlier in this post that I have to implement indexing on a group of tables. Before raising my question, I just want explain my situation again.

    In my table (as I earlier explained), there are 2 bit type columns.

    1. first one is for record Status. Record is active or not.

    2. second one is for record state. We don't delete any record physically. We just change the state flag from 1 to 0.

    3. there are few foreign key columns, contains int ID, which are referenced other base tables. Suppose record_cat, record_subcat1 & record_subcat2 column.

    record_cat will also be included in most of the queries.

    Now, my question is:

    As I will have to include both status & state columns in each & every query. But these columns will return a very large number of rows. because approx 70% of records are active.

    record_state: 85%

    record_status: 78%

    record_cat: 60%

    record_subcat1: 40%

    record_subcat2 : 20%

    so,If I create a non clustered index like: record_state, record_status, record_cat, record_subcat1, record_subcat2

    then first two columns will produce a large number of rows & then this is the responsibility of record_cat, record_subcat1 and record_subcat2 to filter out the data.

    but if I consider record_subcat2, record_subcat1, record_cat, record_status, record_state, this sequence is also a covering index but in just a opposite scenario. means record_subcat2 just filter out most of the records and so on.

    But If i consider your your recommendation (most selective column first) then record_state & record_status should be first.

    please give me an advice.

  • But state and status are not the most selective, they are the least selective.

    Selectivity is how unique the data is. The most selective column would be the column with the most unique values in the table for the columns you are considering for that index. Meaning, you don't always put the single most unique column as the first column, but the most unique column of the ones you are currently choosing from for this index (god I hope that's clear).

    So no, you probably don't want to have those bit columns as the first columns of your index. But, you probably will want to regularly have them in the index. Whether they should be part of the key or just an inclusion column, only testing will tell (I lean towards part of the key, but I'd test both).

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Selectivity should not be the primary indicator for choosing index columns. Putting the most selective column first would be stupid if nothing filters by it. The decision as to column order in indexes should be what queries filter by that column. Selectivity is a secondary consideration.

    I've linked this blog post again, it's discussed in there or in the posts linked to.

    http://sqlinthewild.co.za/index.php/2011/11/11/sql-university-advanced-indexing-indexing-strategies/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (2/8/2013)


    Selectivity should not be the primary indicator for choosing index columns. Putting the most selective column first would be stupid if nothing filters by it. The decision as to column order in indexes should be what queries filter by that column. Selectivity is a secondary consideration.

    I've linked this blog post again, it's discussed in there or in the posts linked to.

    http://sqlinthewild.co.za/index.php/2011/11/11/sql-university-advanced-indexing-indexing-strategies/

    Exactly. It's only a guideline, a starting point, the first test, if it makes sense.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • yes, this is clear now.

    Thank you for this good explanation.

    I am afraid that I again asking a question. But before asking again, I have to explain my case.

    record_cat is foreign key for record_subcat1 & record_subcat1 is foreign key for record_subcat2.

    So, as per your explanation, record_subcat2 is most selective column in our index &

    record_subcat2, record_subcat1, record_cat, ..........

    is preferred sequence.

    This will work fine If we search record in subcategory2. But if we search records in subcategory1 (category will also be there),

    now subcategory2 will not be present & because subcategory2 is not present in where clause, then above index will become useless because it checks the first column in index column sequence.

    Second case:

    in absence of subcategory2, if we consider those records where subcategory2 is NULL like below

    subcategory2 is null and subcategory1 = x and category = y (on the basis of above index sequence)

    in this case, this query will become non SARGable.

    My main motive is: how many (minimum) index should I create & what should be the sequence of columns.

    I have tested all situations & these are similar to above.

    Tuning adviser suggesting status & state columns as very first column for every query which I tested & for rest of the columns, tuning adviser suggests a different index according to sequence of columns in where clause (few situations I explained above).

  • There just aren't very many hard and fast rules for how to put this together. Yes, if a given column isn't used that is in an index key, that might make it less likely for that index to be used for other queries. But you have to test to be sure.

    As to having the exact same columns in the leading edge on multiple queries, I would be extremely hesitant there. It can lead to good indexes not being recognized and used by the queries that need it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Anuj Rathi (2/8/2013)


    record_cat is foreign key for record_subcat1 & record_subcat1 is foreign key for record_subcat2.

    So, as per your explanation, record_subcat2 is most selective column in our index &

    record_subcat2, record_subcat1, record_cat, ..........

    is preferred sequence.

    All of which is completely, absolutely, totally irrelevant.

    What queries filter on those columns and what columns are filtered on mostly? That is what you need for indexing. Putting an index on the most selective column or in the order of selectivity is a total waste of time if the queries are not going to use them. Selectivity is something you consider, not the only thing that determines index column ordering.

    http://sqlinthewild.co.za/index.php/2011/11/11/sql-university-advanced-indexing-indexing-strategies/

    and the posts that it links to.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thank you Fritchey, Gail,

    Now, I am much satisfied.

    Finally, this is the conclusion that I will check those columns which takes part in the query mostly & which column(s) are (mostly) responsible to populate the output, which column filters data to make the output as narrow as possible (in all possible/mostly queries) & then I will try to make a some/few unified/homogeneous on the basis of above consideration.

    Am I going to a right direction?

  • Sounds mostly reasonable, yes.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • To change directions a bit, I am facing these same issues in the system I am working on. There are a number of places in the UI where the users can perform a "catch all" query. The original code simply times out. Some of these have as many as 90 different parameters that may or may not be used.

    The first thing I did was to create a logging table that stored the values of all of the parameters, with an identity and a created time stamp.

    At the beginning of each proc, I insert the values of all of the parameters into this table. I was able to capture exactly what the users were searching on, and then try to do some analysis.

    One of the searches indicated that there were 7 values used in 70% of the searches. I test if all 7 are in place, and wrote the query, and the proper indexes, to make this fast. No dynamic SQL!

    If any of the other parameters were used, I used dynamic SQL and was able to tune these easier.

    This was the easy one!

    The hard one contained 92 different parameters. In capturing the values of the parameters being used, the pattern showed that the searches were similar according to customer location. One location did one search all the time, another did something else.

    It turned into 4 different "base" searches.

    In this case, there were 4 separte procedures written, all using dynamic SQL, that covered these 4 bases. Again, I was able to divide and conquer becasue I captured the actual parameters being used in the searches.

    The primary thing this showed me was that there were some parms that were rarely used, which allowed me to completly ignore them in my efforts.

    Now, MOST of the searches are very fast, and the users are happy. The "odd" searches take time to complete, but the users understand this.

    I do monitor the parameter usage to see if it may be changing.

    Good luck.

    Michael L John
    If you assassinate a DBA, would you pull a trigger?
    To properly post on a forum:
    http://www.sqlservercentral.com/articles/61537/

  • thanks John,

    this is really a very good & simple approach to catch all the search patterns. I didn't thought about such logging table. I will definitely follow this approach.

    I more thing I have noticed today.

    Our Ex DBA has created all columns using some prefix. E.g.: catID for CategoryID, catvalue for CategoryName & then made a view FOR EACH TABLE where he has defined fully qualified names (means CategoryID, CategoryName) for table columns.

    His point was: "I don't want to disclose ACTUAL column names of any table to UI developers & hence we will not use actual column names in UI application. This will reduce the sql injection. So, in SELECT operation, we will use VIEW not table directly."

    But my opinion is different. We should create view only for those complex queries which we use in multiple places. because if we create a VIEW(which contains only a table to map the column name) for each table, this will create an extra overhead & processing while executing the query.

    If I talk about SQL Injection, we have used stored procedure & created a separate login/user & GRANTED ONLY EXECUTE permission on a specific SCHEMA which contains only UI application related stored procedures.

    Then why are we putting extra load to MAP columns name?

Viewing 12 posts - 16 through 26 (of 26 total)

You must be logged in to reply to this topic. Login to reply