index scan where seek expected

Question

index scan where seek expected

jontris

Right there with Babe

Points: 714
More actions
August 28, 2018 at 5:30 am

#376776

On a SQL 2016 instance I have a large table with dozens of columns including:
COLA INT
COLB BIT
and nonclustered index:
(COLA ASC, COLB ASC)
I observe the following:
select * from table where COLA = n and COLB = 1            /* index scan */
select * from table where COLA = n and COLB = 0            /* index seek */
select * from table where COLA = n /* index seek */
select COLA, COLB from table where COLA = n and COLB = 1     /* index seek */
In other words it behaves as expected in every case except when there is the combination of 'select *' for the column list and 'COLB = 1' as second predicate. The seeks are instant but the scan takes minutes, so this is a problem. I haven't attempted to isolate the problem further which could take forever, but am I missing something obvious?

Viewing 15 posts - 1 through 15 (of 37 total)

You must be logged in to reply to this topic. Login to reply

Jonathan AC Roberts SSCoach Points: 18263 More actions · Answer 1

jontris - Tuesday, August 28, 2018 5:30 AM
On a SQL 2016 instance I have a large table with dozens of columns including:
COLA INT
COLB BIT
and nonclustered index:
(COLA ASC, COLB ASC)
I observe the following:
select * from table where COLA = n and COLB = 1            /* index scan */
select * from table where COLA = n and COLB = 0            /* index seek */
select * from table where COLA = n /* index seek */
select COLA, COLB from table where COLA = n and COLB = 1     /* index seek */
In other words it behaves as expected in every case except when there is the combination of 'select *' for the column list and 'COLB = 1' as second predicate. The seeks are instant but the scan takes minutes, so this is a problem. I haven't attempted to isolate the problem further which could take forever, but am I missing something obvious?

Have you got statistics on the table?
https://docs.microsoft.com/en-us/sql/t-sql/statements/create-statistics-transact-sql?view=sql-server-2017

Grant Fritchey SSC Guru Points: 398694 More actions · Answer 2

jontris - Tuesday, August 28, 2018 5:30 AM
On a SQL 2016 instance I have a large table with dozens of columns including:
COLA INT
COLB BIT
and nonclustered index:
(COLA ASC, COLB ASC)
I observe the following:
select * from table where COLA = n and COLB = 1            /* index scan */
select * from table where COLA = n and COLB = 0            /* index seek */
select * from table where COLA = n /* index seek */
select COLA, COLB from table where COLA = n and COLB = 1     /* index seek */
In other words it behaves as expected in every case except when there is the combination of 'select *' for the column list and 'COLB = 1' as second predicate. The seeks are instant but the scan takes minutes, so this is a problem. I haven't attempted to isolate the problem further which could take forever, but am I missing something obvious?

Focus on the estimated rows returned. That will tell you why the optimizer thought a scan was superior. If you're running the latest CU on 2016, you should also see the statistics that were used to generate the plan. You can use those to check what's going on, per Johnathan Roberts post above.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

jontris Right there with Babe Points: 714 More actions · Answer 3

Thanks for the suggestions. Stats were out of date but updating didn't make any difference. Rebuilding the index did, so yes I missed something obvious. Doesn't explain why the index was used and worked perfectly well in most cases but not in the particular case described, but hey, the optimiser is a quirky and imperfect beast. I should know that by now.

Grant Fritchey SSC Guru Points: 398694 More actions · Answer 4

jontris - Tuesday, August 28, 2018 7:07 AM
Thanks for the suggestions. Stats were out of date but updating didn't make any difference. Rebuilding the index did, so yes I missed something obvious. Doesn't explain why the index was used and worked perfectly well in most cases but not in the particular case described, but hey, the optimiser is a quirky and imperfect beast. I should know that by now.

So...

Rebuilding an index, by itself, will not affect execution plans in any way. What happened is probably, you updated the stats using sp_updatestats, or you used UPDATE STATISTICS, but specified something other than FULLSCAN. You get a sampled statistics update. You rebuilt the index. After the index is rebuilt, it also updates the statistics, but it does so with a FULLSCAN. It's that FULLSCAN that gave you what you need. You should know this because, you can directly control the statistics so you arrive at the FULLSCAN as you desire rather than rebuilding indexes (which, I now consider a dubious thing to do, except for columnstore indexes).

Good to know it was just the statistics being off.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

jontris Right there with Babe Points: 714 More actions · Answer 5

Thanks Grant. I accept what you say about it being the FULLSCAN that made the difference, not the rebuilding per se. I would argue though that in the circumstances described there is no rationale for 'COLA = vala' generating a different plan from 'COLA = vala and COLB = valb', however unreliable the stats. If using the index filters 'COLA = vala' in the most efficient way then it must perforce also filter 'COLA = vala and COLB = colb' in the most most efficient way, as the latter could never return more rows to be looked up, only fewer, or in the worst scenario the same number. At least that's how I see it. A moot point now of course. It doesn't alter the fundamental principle - use current stats.

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 6

While I agree that rebuilding an index will typically not help code and that having proper stats usually will, there is one time where rebuilding an index actually does help.

If the data for a query isn't already in memory, then it must be read into memory before it can be used. SQL Server attempts to guess what's next and will do "Read Aheads" to anticipate the next request for data, which is normally very appropriate. If the index is logically fragmented to a large extent (no pun intended), then it "short-sheets" (Grant, you know what that is because you road the boats) the read aheads requiring more trips to the disk, which is the slowest thing in the whole system if it's a physical disk behind the scenes.

Both rebuilding or reorganizing such an index will solve the read-ahead problem but, and I don't say this lightly because of the last 10 months of extreme testing I've been doing and documenting, don't use reorganize to solve such a problem. I does little to nothing to solve page density problems and it removes critical free space that's needed to help prevent page splits for inserts and some forms of updates.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jonathan AC Roberts SSCoach Points: 18263 More actions · Answer 7

jontris - Tuesday, August 28, 2018 7:40 AM
Thanks Grant. I accept what you say about it being the FULLSCAN that made the difference, not the rebuilding per se. I would argue though that in the circumstances described there is no rationale for 'COLA = vala' generating a different plan from 'COLA = vala and COLB = valb', however unreliable the stats. If using the index filters 'COLA = vala' in the most efficient way then it must perforce also filter 'COLA = vala and COLB = colb' in the most most efficient way, as the latter could never return more rows to be looked up, only fewer, or in the worst scenario the same number. At least that's how I see it. A moot point now of course. It doesn't alter the fundamental principle - use current stats.

Yes, that would be very strange if you used the same value of "vala".
When you ran the query did you use exactly the same value "vala" in the WHERE COLA = 'vala' for both queries or was it a different value?

Grant Fritchey SSC Guru Points: 398694 More actions · Answer 8

jontris - Tuesday, August 28, 2018 7:40 AM
Thanks Grant. I accept what you say about it being the FULLSCAN that made the difference, not the rebuilding per se. I would argue though that in the circumstances described there is no rationale for 'COLA = vala' generating a different plan from 'COLA = vala and COLB = valb', however unreliable the stats. If using the index filters 'COLA = vala' in the most efficient way then it must perforce also filter 'COLA = vala and COLB = colb' in the most most efficient way, as the latter could never return more rows to be looked up, only fewer, or in the worst scenario the same number. At least that's how I see it. A moot point now of course. It doesn't alter the fundamental principle - use current stats.

It's down to the statistics driving row estimates. You'd have to see how the optimizer saw each data set. Without seeing the plans, I can't tell you what's up, but it's something along those lines. That much I can say for sure. It really doesn't matter what you or I think behavior should be. It's what the optimizer decides based on the row counts.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

jontris Right there with Babe Points: 714 More actions · Answer 9

Yes vala was the same in both queries. Actually, because it returned 0 rows I found a value of vala that returned > 0 rows and tried that in both queries as well. Query plan-wise it made no difference at all. I'm not so concerned about the optimiser getting it wrong sometimes. If it didn't get it right most of the time we wouldn't be using SQL Server. It's more about me excusing myself for missing the obvious. Back to real work now, but thanks again for your time.

ScottPletcher SSC Guru Points: 100949 More actions · Answer 10

If you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]

Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 11

ScottPletcher - Tuesday, August 28, 2018 9:29 AM
If you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]
Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.

Before you cluster on a column, understand the fragmentation patterns during inserts and expansive updates or you may be in for a nasty surprise when it comes to page splits and future read-aheads, which can cause huge performance issues. Also understand that the leaf level of clustered indexes is usually far wider than a proper non-clustered index and the performance and memory ramifications for that little issue. Clustering is not a panacea of performance. There are, indeed, other serious side affects that you must also take into consideration.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100949 More actions · Answer 12

Jeff Moden - Tuesday, August 28, 2018 11:43 AM
ScottPletcher - Tuesday, August 28, 2018 9:29 AM
If you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]
Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.
Before you cluster on a column, understand the fragmentation patterns during inserts and expansive updates or you may be in for a nasty surprise when it comes to page splits and future read-aheads, which can cause huge performance issues. Also understand that the leaf level of clustered indexes is usually far wider than a proper non-clustered index and the performance and memory ramifications for that little issue. Clustering is not a panacea of performance. There are, indeed, other serious side affects that you must also take into consideration.

Unless the values are just extraordinarily random, that's almost never an issue. Nothing's a panacea, but best clustering is the closest thing to a panacea that you have.

Don't forget that when the data is properly clustered, you can then remove all non-clus indexes that start with ( ColA ) or ( ColA, ColB ).

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 13

ScottPletcher - Tuesday, August 28, 2018 11:54 AM
Jeff Moden - Tuesday, August 28, 2018 11:43 AM
ScottPletcher - Tuesday, August 28, 2018 9:29 AM
If you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]
Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.
Before you cluster on a column, understand the fragmentation patterns during inserts and expansive updates or you may be in for a nasty surprise when it comes to page splits and future read-aheads, which can cause huge performance issues. Also understand that the leaf level of clustered indexes is usually far wider than a proper non-clustered index and the performance and memory ramifications for that little issue. Clustering is not a panacea of performance. There are, indeed, other serious side affects that you must also take into consideration.
Unless the values are just extraordinarily random, that's almost never an issue. Nothing's a panacea, but best clustering is the closest thing to a panacea that you have.
Don't forget that when the data is properly clustered, you can then remove all non-clus indexes that start with ( ColA ) or ( ColA, ColB ).

Contrary to popular belief, extraordinarily random columns would actually be better than anything that has a limited domain, suffers out of order inserts or updates, or is subject to inserts followed by expansive updates. And, no... clustering isn't even close to being a panacea because they don't actually know of all the other things that I speak of.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

ScottPletcher SSC Guru Points: 100949 More actions · Answer 14

Jeff Moden - Tuesday, August 28, 2018 12:15 PM
ScottPletcher - Tuesday, August 28, 2018 11:54 AM
Jeff Moden - Tuesday, August 28, 2018 11:43 AM
ScottPletcher - Tuesday, August 28, 2018 9:29 AM
If you (almost) always query by COLA or ( COLA, COLB ), then cluster the table on ( COLA, COLB ). Then you'll always get a seek with either of those searchesm, and you don't have to constantly adjust non-clus indexes as stats change, or constantly test stats. [Nonclus indexes are typically not useful for any significant numbers of rows (unless you make them covering indexes, in which case you end up storing table data 2-6x to maintain all the covering mini-tables).]
Btw, forget any myth about (almost) "always clustering by identity". IF you (almost) always query the table that way, cluster it that way. All your queries should speed up without additional work.
Before you cluster on a column, understand the fragmentation patterns during inserts and expansive updates or you may be in for a nasty surprise when it comes to page splits and future read-aheads, which can cause huge performance issues. Also understand that the leaf level of clustered indexes is usually far wider than a proper non-clustered index and the performance and memory ramifications for that little issue. Clustering is not a panacea of performance. There are, indeed, other serious side affects that you must also take into consideration.
Unless the values are just extraordinarily random, that's almost never an issue. Nothing's a panacea, but best clustering is the closest thing to a panacea that you have.
Don't forget that when the data is properly clustered, you can then remove all non-clus indexes that start with ( ColA ) or ( ColA, ColB ).
Contrary to popular belief, extraordinarily random columns would actually be better than anything that has a limited domain, suffers out of order inserts or updates, or is subject to inserts followed by expansive updates. And, no... clustering isn't even close to being a panacea because they don't actually know of all the other things that I speak of.

I've got thirty years of experience with rdbms tuning, the last 17 years focused on SQL Server. I've had only a handful of dbs where the proper clustering for a majority of tables was an identity column. There's most often something much better.

Yes, one should check index usage stats, missing index stats, cardinality and other things, but most often, identity is the wrong clustering for best overall performance. All the I/O from the gazillions of covering indexes really adds up.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.