which index to use to limit rows in a big table?

  • Hi,

    I have a following table:

    CREATE TABLE test_pk (test_id INT PRIMARY KEY, next_id INT, status TINYINT, field1, field2 etc.)

    The table represents a linked list with each record pointing to the next record in the same table via next_id field. Some records have next_id IS NULL, these are so-called active records that I use in most of my database queries. So I need a way to set up an index to help me limit the query results to active rows only.

    Another important parameter that is used in many queries is status field.

    The table has about 1.5 million rows. There are about 150'000 rows with next_id IS NULL. Average row size is ca 500 bytes.

    A non-clustered index is defined on field next_id. Also, for testing purpose, I added two compound indexes: "next_id, status" and "status, next_id".

    Now to the problem: when a query is run that uses filter by next_id and status fields, the SQL server does not use non-clustered indexes. Instead, clustered index is used, which is essentially a table scan and will create large amount of logical reads.

    SELECT * FROM test_pk WHERE next_id IS NULL AND status = 1

    Clustered index scan: PK_test_pk

    (4537 row(s) affected)

    Table 'test_pk'. Scan count 1, logical reads 83839, physical reads 0

    SELECT * FROM test_pk WITH (INDEX=IX_test_pk_next_id_status) WHERE next_id IS NULL AND status = 1

    Index seek: IX_test_pk_next_id_status

    Cluster index seek: PK_test_pk

    Nested loops

    (4537 row(s) affected)

    Table 'test_pk'. Scan count 1, logical reads 13915, physical reads 0

    Due to index hint, the second query reads 6 times less data and runs 2-3 times faster. But is it possible to achieve the same (or better) result without index hint?

    TIA,

    Denis

    SQL2005 SP2

  • Do you have to do select *? Selecting all the fields essentially makes creating covering indexes impossible.

    Try rebuilding both the NC and the clustered indexes and see if that makes a difference.

    My guess is that the query optimiser decided that doing all the lookups into the clustered index was more expensive than just scanning the cluster. Lookups into the cluster are random IOs and a random IO is around 8 times as expensive as as sequential IO.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • You can add status to the NC index. The optimizer chooses a table scan because there are so many rows with NULL values for the SARG. I bet your status column isn't very specific either. Even adding the status field to the index may not help the optimizer 'know' that only 4K+ rows out of the 150K match the secondary filter (status =1 )

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • first i would try to drop and recreate the index. i've seen instances where it's not fragmented, but it won't use an index seek for some reason until you rebuild the index. could be statistics.

    otherwise you may have to use the index hint because there is a threshold after which SQL will use a scan instead of a seek.

  • Forcing the index though gets you a horrible index seek/bookmark lookup plan in those cases when the status value being filtered is a large fraction of the 150K null values in the other column.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks to everyone,

    GilaMonster, I cannot use convering index because next_id and status are only used for finding records, but to display what I've found I need all fields.

    I did DBCC DBREINDEX(test_pk), but that didn't change much, non-clustered indexes are still not chosen by SQL.

    TheSQLGuru, you are right about status, it is not very selective. But the interesting thing is, if I do the same SELECT for status = 0 (which is used in the table, but no active rows have such status), then SQL is *still* not using NC index and prefers his huge PK scan. Same query with index hint yields just 4 logical reads and is of course lightning fast.

    But, when I do same SELECT for status = 99, which is not used at all, then finally query planner realizes it has a NC index and utilizes it.

    My guess is that since server has statistics for "compound" indexes, it is possible to determine whether those indexes are "good to use" for given parameter values. The optimizer should understand that condition (next_id IS NULL AND status=0) has zero hits in compound indexes, and this should be a good reason to utilize the index.

    Trying to figure out why it doesn't happen, I added a new column is_active, filled it with 1 for rows with next_id IS NULL, and created an index on this new column.

    SELECT * FROM test_pk WHERE is_active = 1 AND status = 0

    Index seek: IX_test_pk_is_active SEEK is_active = 1

    Index seek: IX_test_pk_status SEEK status = 0

    Merged Join

    Index seek: PK_test_pk SEEK test_id

    Nested Loops

    (0 row(s) affected)

    Table 'test_pk'. Scan count 1, logical reads 350

    This looks much better, but the next select for status = 1 will still prefer the cluster index scan 🙁 even if I add compound indexes (is_active, status) and (status, is_active).

    One more thing that really depressed me, is that somehow, adding compound indexes screwed the good query plan for status = 0 and it is now also index scan. Removing compounds restores good plan. Looks like nonsense to me.

    BTW I agree that index hints can kill your query if not used with caution, that's why I always try to avoid them at all costs. My queries often have many dynamic parameters, so I'd rather leave choosing indexes to query optimizer.

    Denis

  • Wapper (10/3/2007)


    GilaMonster, I cannot use convering index because next_id and status are only used for finding records, but to display what I've found I need all fields.

    You might still be able to cover the query, with the INCLUDE feature or SQL 2005.

    How many fields are in the table, and are all of them really necessary to display?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Can you run?

    UPDATE STATISTICS tblname WITH FULLSCAN,ALL


    * Noel

  • GilaMonster, the table has a lot of columns, and not all of them are needed every time. But no matter if I do select * or only include one column not covered by index, the cost is always same 80k+ pages read.

    Thanks for pointing out the INCLUDE option, it is no doubt useful feature but I'd like to maintain SQL2000 compatibility on script level for this particular database.

    noeld, thanks, did that, nothing has changed after that.

    Denis

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply